Existing models approximate Optimal Transport
Existing models of cooperative communication are approximations of OT. We demonstrate this point by expressing representatives of three broad classes of models as OT. The first class of models [Shafto2008, Shafto2014, shafto2012learning] are based on the classic Theory of Mind recursion, which compute exact answers for the case of . The second class includes models that compute the first step of the recursion [goodman2013knowledge, Eaves2016b, Eaves2016c], and approximate the OT solution with this probability distribution. The third class includes models that compute the first step by selecting the data that maximize the probability of the hypothesis [hadfield2016cooperative, ho2016showing, ho2018effectively, fisac2017pragmatic], and approximate the complete transport plan with a single data for each hypothesis. These models have characteristic strengths and limitations, which the literature has yet to explore in their fullness. After unifying these approaches as OT, we will derive and contrast these consequences, as well as expose new, yet unexplored algorithms and computational tools through which we may understand communication.
Full recursive reasoning is optimal transport
Cooperative models that build on the classic Theory of Mind recursion include cooperative inference [YangYGWVS18, wang2018generalizing] and pedagogical reasoning [Shafto2008, Shafto2014, shafto2012learning]
. In this section, we will briefly review the work on cooperative inference and illustrate how Bayesian inference models fit into our unifying OT framework. The core of cooperative inference between two agents is that the teacher’s selection of data depends on what the learner is likely to infer and vice versa. Letbe the learner’s prior of hypothesis , be the teacher’s prior of selecting data , be the teacher’s posterior of selecting to convey and be the learner’s posterior for given . Cooperative inference emphasizes that agents’ optimal communication plans, and should satisfy the following system of interrelated equations for any and , where and are the normalizing constants:
Extending [YangYGWVS18]’s results on uniform priors, we show that: 222All proofs are included in the Appendix. Optimal communication plans, and , of a cooperative inference problem with arbitrary priors denoted by and , can be obtained through Sinkhorn scaling. As a direct consequence, cooperative inference is a special case of the unifying OT framework with . Let be the joint distribution, be the teacher’s prior and be learner’s prior. According to the proof of Proposition Document, after cooperative inference, the teacher’s posterior selection matrix is the limit of -SK scaling of . On the other hand, under the unifying OT framework, the optimal teaching plan is the limit of -SK scaling of based on Eq eq:ot_teaching. When the teacher’s expense of selecting is proportional to , . Symmetrically, one may check the same holds for .
One-step approximate inference
Direct implementation of the recursive Theory of Mind above requires repeated computation of the normalizing constant for Bayesian inference. This is computationally challenging for large scale problems and has been argued to be algorithmically implausible as a model of human cognition. For these reasons, models including Rational Speech Act (RSA) theory [goodman2013knowledge] and Bayesian Teaching [Eaves2016b, Eaves2016c] model cooperation as a single step of recursion. To simplify exposition, we focus on the RSA model. RSA models the communication between a speaker and a listener, formalizing cooperation that underpins pragmatic language [goodman2013knowledge, grice1975logic, levinson2000presumptive, clark1996using]. A pragmatic speaker selects an utterance optimally to inform a naive listener about a world state. Whereas a pragmatic listener interprets an utterance rationally and infers the state using one step Bayesian inference. This represents a communicative process where a speaker-listener pair can be viewed as a teacher-learner pair with world states-utterances being hypotheses-data points, respectively. RSA distinguishes among three levels of inference: a naive listener, a pragmatic speaker and a pragmatic listener. A naive listener interprets an utterance according to its literal meaning. That is, given a joint distribution the naive listener’s selection of given is the -th element of which is obtained by row normalization of . A pragmatic speaker selects an utterance to convey the state such that maximizes utility. In particular, he picks to convey by soft-max optimizing expected utility,
where utility is given by , which minimizes the surprisal of a naive listener when inferring given with an utterance cost . This formulation is the same as one step of SK iteration in OT framework (see Eq eq:sk_distance and Eq eq:teacher_cost) where , as in [goodman2013knowledge]. Next, a pragmatic listener reasons about the pragmatic speaker and infers the hypothesis using Bayes rule, P_L(h_j—d_i) ∝P_T(d_i—h_j) P_L(h_j), Here represents the listener’s reasoning on the speaker’s data selection and is the learner’s prior. This is again one step recursion of OT framework of . As described above, teaching and learning plans in RSA are one-step approximations of the OT plans. Although limited recursion and optimization are realistic assumptions in psychology [goodman2013knowledge], in many cases, such approximations are far from optimal. For example, world states can often be organized from most abstract to least, which yield a upper triangular joint distribution matrix. Fully recursive model as cooperative inference would output a diagonal matrix as optimal plan [YangYGWVS18], which achieves the highest efficiency, whereas cooperative index of one step approximation is much lower. Furthermore, one-step approximation plans are much more sensitive to agents’ estimation of the other agent. For instance, a pragmatic speaker’s teaching plan is tailored for a naive listener, in contrast the optimal plan obtained through fully recursion is stable for any listener derived from the same common ground as in Remark Document.
Single-step argmax approximation
This idea arises in cooperative inverse reinforcement learning, where instead of selecting acts probabilistically, the maximum probability action is always selected[hadfield2016cooperative, ho2016showing, ho2018effectively, fisac2017pragmatic]. In particular, [fisac2017pragmatic] introduces Pragmatic-Pedagogic Value alignment, a framework that is grounded in empirically validated cognitive models related to pedagogical teaching and pragmatic learning. Pragmatic value alignment formalizes the cooperation between a human and a robot who perform collaboratively with the goal of achieving the best possible outcome according to an objective. The true objective however is only known to the human. The human performs pedagogical actions to teach the true objective to the robot. After observing human’s action, the robot, who is pragmatic, updates his beliefs and perform an action that maximizes expected utility. The human, observing this action, can then update their beliefs about the robot’s current beliefs and choose a new pedagogic action. Denote actions by and objectives by . We can see that when the human performs the action they act as a teacher and when robot is performing the action it is vice versa. In particular, the pedagogic human selects an action to teach the objective according to Eq eq:utility, where is the utility that captures human’s best expected outcome. As described in Section Document, this is equivalent to a single step recursion in the OT framework. Denote robot’s prior belief distribution on the objectives by . The robot interprets the human’s action rationally and updates his beliefs about the true objective using Bayes rule as Eq eq:Bayes. Then acting as a teacher, the robot chooses an action that maximizes the human’s expected utility using argmax function: where, denotes the robot’s actions and denotes the human’s actions. Unlike in human teaching where the plans are chosen proportionally to a probability distribution, here the robot chooses a deterministic action using argmax function. As described above, pragmatic-pedagogic value alignment is modeled by computing a single step of OT and selecting the action that maximizes the outcome. Unlike cooperative inference, which tends to select the leading diagonal of the common ground as (Proposition Document), pragmatic-pedagogic value alignment selects the maximal element in each column of , which is not even guaranteed to form a plan to distinguish every hypothesis. As a consequence, a drawback of such argmax method is that for large hypothesis spaces, multiple hypotheses may reach argmax on the same data which lead to low communication efficiency. Further, analysis in Section Document shows that deterministic methods as argmax are much less robust to perturbations.
Analyzing models of cooperative communication
With prior models unified as instances of Optimal Transport via Sinkhorn scaling, we analyze the properties of these models. We focus on two of the most important aspects: understanding the models from statistical perspective and in the context of realistic assumptions about common ground.
Full recursive reasoning is statistically and information theoretically optimal
Having demonstrated the equivalence of SK scaling and full recursive reasoning (Proposition Document), strong statistical justifications of fully Bayesian recursive reasoning follow immediately as SK scaling is optimal in the senses of entropy minimization and likelihood maximization [csiszar1975divergence, darroch1972generalized, brown1993order]. Sinkhorn scaling solves entropy minimization with marginal constraints. Let be a joint distribution matrix over and . Denote the set of all possible joint distribution matrices with marginals and by (all couplings). Consider the question of finding the approximation matrix of in that minimizes its relative entropy with , i.e. , where,
It is proved in for example [csiszar1989geometric, franklin1989scaling] that the -Sinkhorn scaling of converges to if the limit exists. We therefore directly interpret the results of fully Bayesian recursive reasoning as the communication plan with minimum discrimination information for pairs of interacting agents. In addition, Sinkhorn scaling also arises naturally as a maximum likelihood estimation. Let be the empirical distribution of i.i.d. samples from a true underlying distribution, which belongs to a model family. Then the log likelihood of this sample set over a distribution in the model family is given by , where is the sample size. Comparing with eq:KL, it is clear that maximizing the log likelihood (and so the likelihood) over a given family of is equivalent to minimizing . Both [darroch1972generalized] and [csiszar1989geometric] show that when the model is in the exponential family, the maximum likelihood estimation of can be obtained through SK scaling with empirical marginals.
Understanding greedy choice
As a preliminary step toward analyzing common ground, we explore the effect on the optimal plans when varies, showing that as the solution converges to the leading diagonals of , as the solution goes to a uniform matrix, and more generally analyzing the variations on the distribution over all possible optimal plans caused by choice of . To simplify notation, we assume uniform priors on and for our discussions. For a given joint distribution , consider the OT problem for the teacher (similarly, for the learner). Recall that, as in Eq eq:ot_teaching, the optimal teaching plan is the limit of SK iteration of . Note that the limits of SK on and (the matrix obtained from by raising each element to power of ) are the same as they are cross-ratio equivalent (Section Document). Therefore to study the dynamics of regularized OT solutions, we may focus on and its SK limit . One extreme is when gets closer to zero. If , for any nonzero element of . Thus converges to a matrix filled with ones on the nonzero entries of , and converges to a uniform matrix if has no vanishing entries. It is shown in [wang2018generalizing] that the cooperative index (Section Document) attains its lower bound on uniform matrices. Hence reaches the lowest communicative efficiency as goes to zero. The other extreme is when gets closer to infinity. In this case, the we show that: is concentrating around the leading diagonals of as . This indicates that as , the number of diagonals of decreases. Therefore increases as since the cooperative index of a matrix is bounded below by the reciprocal of its number of positive diagonals [wang2018generalizing]. In particular, if has only one leading diagonal, converges to a doubly stochastic matrix with only one positive diagonal, i.e. a permutation matrix. In this case, , which suggests the highest communication efficiency. As pointed out in Section Document, the OT planing which picks the best diagonals is notably different from the argmax selection. In general, magnitude of causes variations to the distribution over all possible optimal plannings. Let be either the joint distribution or an agent’s planning matrix derived from . Notice that the product of elements on a diagonal of is proportional to the probability of sampling from all ’s diagonals. Then the cross-product ratio between its two diagonals and is precisely the ratio between probabilities of sampling and . Proposition Document shows that the optimal plan of an agent is concentrated on diagonals of . Thus, up to normalization, each represents a distribution over all possible optimal plans (diagonals). And constitute the true distribution over optimal plans derived from since is cross-product ratio equivalent to . , in contrast, represents a distribution that either exaggerates or suppresses cross-product ratios of , depending on whether is greater or less than one.
Analyzing sensitivity to common ground
In this section, we investigate the sensitivity of the OT framework under perturbations on the common ground among agents. This ensures robustness of the inference where agents’ beliefs differ, which shows the viability of our model in practice. First, the robustness of OT planning with a fixed regularizer is considered. In this case, optimal plans are obtained through SK scaling of an initial matrix with given marginal conditions and . This can be viewed as a map, denoted by , from to the SK limit. [wang2018generalizing] explored the sensitivity of to perturbation on elements in . They pointed out that is continuous on . In particular, they demonstrated that is robust to any amount of off-diagonal perturbations on . SK scaling is also continuous on its scalars. Let and be vectors obtained by varying elements of and at most by , where quantifies the amount of perturbation. Distances between vectors or matrices are measured by norm (the maximum element-wise difference), e.g. . We prove that is continuous on and , thus the following holds: For any joint distribution and positive marginals and , if and exist, then as . Continuity of implies that small perturbations on the joint and marginal distributions, yield close solutions for optional plans. Thus cooperative communicative actions based on the unifying OT framework are stable on variations of agents’ estimations of the common ground. Moreover, when restricted to positive joint distribution , [luise2018] shows that is in fact smooth on and . Built on their proof technique, we further extend the smoothness of to . Therefore, the following holds: 333General result on non-negative joint distributions is stated and proved in Appendix LABEL:apd:_smooth. Let be the set of positive initial matrices, and be the set of all positive marginal distributions over and respectively. Then is . Theorem Document guarantees that the optimal plans obtained through SK scaling are infinitely differentiable. In particular, we may explicitly derive the gradient of with respect to both marginals and joint distributions. (Closed form of gradients are included in Appendix LABEL:apd:gradient.) An advantage of having these closed forms is that a fully recursive agent can quickly reconstruct a better cooperative plan using gradient descent methods once he realized the deviation from the previously assumed common ground. Importantly, choice of affects the sensitivity to violations of common ground. Without loss, assuming uniform priors on and , consider the case where the teacher has the accurate and the learner’s estimation of the joint distribution contains additive deviation on an element of . When the deviation occurs on the leading diagonals, the optimal plan for the leaner is the same as if they had the precise since the location of the leading diagonal is unchanged. However, problems may occur when the deviation occurs on an element contained only in a non-leading diagonal. Intuitively, if the deviation is large enough, the rank of the diagonals in , which determines the learner’s optimal plan, will change. This will cause a difference in two agents’ optimal plans, which would reduce the communication efficiency. Formally, let be a leading diagonal and be a non-leading diagonal of , and and be products of their elements respectively. Further let and be the corresponding diagonals in . Then and . If , then and will become the leading diagonal of . Hence the learner’s optimal plan will change. In light of this, we have: The stability of a joint distribution is: , where the minimum is taken over all non-leading diagonals and all -entries contained only in non-leading diagonals. Analysis in previous paragraph shows that when the deviation , the leading diagonal in is unchanged no matter where the deviation arises. In this case, the learner may safely pick a sufficiently large . Yet when , any value of will decrease the probability of a mutually agreed upon solution. In the absence of strong constraints on potential violations of common ground, i.e. strong constraints on the maximum value of deviation , and the number of such deviations, is recommended as it preserves cross-product ratios. With probability equals , an matrix contains exactly one leading diagonal. Assume the deviation appears uniformly on each entry of . Then with probability , it appears on a non-leading diagonal element. Thus for large , deviation occurs almost surely on locations which could cause changes on leading diagonals, and represents deviation from the true optimal plan. Assuming independence, the vast majority of deviations would be of the unhelpful variety, thus decreasing the probability of agreement between agents about the leading diagonal. When decisions are more severe because the exaggeration of differences between diagonals pushes an agent’s estimation of optimal plan further away from the true optimal plan. Moreover, additive deviation can not only shuffle the rank of existing diagonals, but also introduce new diagonals. is much more sensitive to such a deviation comparing to (see further discussion in Appendix LABEL:apd:_add_epsilon). Thus, belief transport is most stable to violations of common ground when agents match, rather than maximize, probabilities. The above discussion also suggests that argmax approximation method described in Section Document is much more sensitive to small perturbations. Similar to leading diagonals, location of argmax in a row or column vary non-continuously with a deviation. This may cause dramatic differences in agents’ action plans, which leads to low cooperative index. Therefore, argmax approaches do not in general yield optimal behavior.
Discussion and Conclusions
Computational-level and rational analyses of cognition hinge on assumptions about the structure for which the mind is optimized. When these analyses focus on properties of world—such as natural scenes or natural categories—these assumptions are hard or impossible to independently validate, which has lead to questions about the utility of the approach. When analyzing cooperative communication, the relevant structure is other people’s belief updating and choices, domains for which we have strong independent theory. Given models derived from the literature, we show it is possible to unify existing algorithmic models, derive statistical interpretations, and derive a priori constraints on models by analyzing robustness of models to important, open theoretical problems. Moreover, in doing so, we expose a new algorithmic-level perspective on the implementation of cooperative communication and recovery from violations of common ground through gradient descent. Our results clarify why, how, and under what conditions cooperative communication may facilitate learning despite violations of common ground. Why can cooperation facilitate learning? Recursive reasoning about others mental states and actions is precisely Sinkhorn scaling, which computes maximum likelihood plans for optimal transport of beliefs from one agent to another. How does cooperation succeed despite violations of common ground? Sinkhorn scaling is a continuous function, which implies that small differences in the inputs yields bounded differences between the outputs. Moreover, the smoothness property additionally guarantees the ability recover from deviations in an online fashion further increasing robustness to violations of common ground. Under what conditions is cooperative communication robust? Cooperative communication is robust to such violations precisely when the plans are based on probability matching, or at least close enough to not magnify the consequences of violations too much. Researchers in cultural anthropology and cognitive development argue that people have evolved a specialized cultural niche and associated learning mechanisms that enables rapid accumulation of knowledge across ontogeny and phylogeny. We provide support for these claims. Specifically, cooperative communication—through the ability to reason about changes in beliefs in response to choices recursively—is a specialized adaptation to learn from other agents. Moreover, this adaptation enables effective transmission of beliefs, and hence accumulation of knowledge, through the computation of maximum likelihood plans that are robust to violations of common ground. Thus, one may theoretically transmit beliefs between agents whose beliefs are quite different, such as parents and children, speakers and listeners, teachers and learners, or across cultural groups, as is necessary to explain rapid accumulation of knowledge. Formation and maintenance of common ground remains a formidable challenge. We have shown that cooperative communication, viewed as Optimal Transport computed through Sinkhorn scaling, has mathematical properties—as a continuous, even smooth, map—that explain how cooperative communication could succeed in theory. Yet, in practice cooperative communication remains challenging. When communication is between teachers and learners or robots and humans the hypotheses may be organized differently or true hypothesis may not be in the hypothesis space. In education, this is because the goal is often inducing conceptual change or introducing new concepts. In robotics, the hypotheses spaces are designed for computational simplicity, rather than fidelity to humans’, and are unlikely to align cleanly or completely. These violations go beyond simple perturbations, and instead involve mismatches between the hypotheses themselves, which violate of the continuity necessary to ensure robustness. Recent empirical results raise questions about the replicability of science across behavioral sciences[open2015estimating]
. Proposed improvements in the design and analysis of experimental results are an important toward addressing these issues. Equally important is the development of stronger, more principled approaches to theory development. While mechanisms like preregistration certainly reduce posthoc experimental and analytic degrees of freedom, they do not address the problem of how to justify hypotheses in the first place and therefore only slow down the rate of posthoc hypothesizing. Our analysis shows that it is possible to derive strong a priori predictions from first principles. Our results focus on cooperative communication, but may be extensible to Theory of Mind and other domains of reasoning that can be construed as recursive reasoning about possible plans. Moreover, the Optimal transport framework, which simply models problems of moving distributions, includes Bayesian inference as a special case, suggesting that this approach may be much more widely relevant to modeling cognition.