# A Rate-Distortion view of human pragmatic reasoning

What computational principles underlie human pragmatic reasoning? A prominent approach to pragmatics is the Rational Speech Act (RSA) framework, which formulates pragmatic reasoning as probabilistic speakers and listeners recursively reasoning about each other. While RSA enjoys broad empirical support, it is not yet clear whether the dynamics of such recursive reasoning may be governed by a general optimization principle. Here, we present a novel analysis of the RSA framework that addresses this question. First, we show that RSA recursion implements an alternating maximization for optimizing a tradeoff between expected utility and communicative effort. On that basis, we study the dynamics of RSA recursion and disconfirm the conjecture that expected utility is guaranteed to improve with recursion depth. Second, we show that RSA can be grounded in Rate-Distortion theory, while maintaining a similar ability to account for human behavior and avoiding a bias of RSA toward random utterance production. This work furthers the mathematical understanding of RSA models, and suggests that general information-theoretic principles may give rise to human pragmatic reasoning.

## Authors

• 6 publications
• 6 publications
• 7 publications
08/12/2021

### Scalable pragmatic communication via self-supervision

Models of context-sensitive communication often use the Rational Speech ...
05/31/2020

### Learning to refer informatively by amortizing pragmatic reasoning

A hallmark of human language is the ability to effectively and efficient...
12/17/2020

### Computational principles of intelligence: learning and reasoning with neural networks

Despite significant achievements and current interest in machine learnin...
08/09/2018

### Efficient human-like semantic representations via the Information Bottleneck principle

Maintaining efficient semantic representations of the environment is a m...
09/21/2021

### Active inference, Bayesian optimal design, and expected utility

Active inference, a corollary of the free energy principle, is a formal ...
05/20/2021

### A practical introduction to the Rational Speech Act modeling framework

Recent advances in computational cognitive science (i.e., simulation-bas...
06/14/2015

### Artificial general intelligence through recursive data compression and grounded reasoning: a position paper

This paper presents a tentative outline for the construction of an artif...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The ability to reason about meaning in the local context of social interactions is a fundamental aspect of human language. For example, pragmatic reasoning is key to inferring non-literal meanings of utterances, efficient use of linguistic ambiguity Peloquin . (2019), and driving language evolution and change Sperber  Origgi (2010); Traugott (2012). Thus, understanding the computational principles that may give rise to pragmatic reasoning is important for studying the forces that shape language and cognition.

A prominent computational approach to pragmatics is the Rational Speech Act (RSA) framework Frank  Goodman (2012); Goodman  Frank (2016). RSA formulates pragmatic reasoning as probabilistic speakers and listeners recursively reasoning about each other’s state of mind with the goal of cooperatively gaining communicative utility. This framework enjoys broad empirical support across a number of psycholinguistic phenomena, such as scalar implicature, irony, metaphor, and hyperbole (for review: Goodman2016). In many cases, RSA’s recursive reasoning is assumed to terminate after one or two iterations, although several studies have explored and motivated deeper recursions (e.g.,  Camerer2004,Franke2016,bergen-levy-goodman:2016-sp,Levy2018,Frank2018). However, much remains unknown about the dynamics of RSA recursion and whether it can be characterized by a well-motivated optimization principle. It has been conjectured that RSA dynamics is guaranteed to increase expected utility (e.g., yuan2018,peloquin-etal:2019), but these explorations have relied on numeric simulations, leaving open key questions about the dynamics of RSA models.

In this work we present a set of analytic results, demonstrated by model simulations, that extend the theoretical understanding of the RSA framework and ground it in Rate–Distortion (RD) theory Shannon (1948) — the subfield of information theory that characterizes efficient compression under limited resources, and has recently been applied to other aspects of language Zaslavsky . (2018); Hahn . (2020) and cognition Sims (2016, 2018); Gershman (2020). Specifically, our main contributions are: (1) We show that RSA recursion is an instance of the alternating maximization algorithm Csiszár  Shields (2004). However, this optimization does not maximize expected utility as previously conjectured, but rather a tradeoff between expected utility and communicative effort. (2) We show that the RSA tradeoff can be generalized to a type of RD tradeoff, which yields a slightly modified model of pragmatic reasoning. We refer to this model as RD-RSA. (3) Building on these results, we study the dynamics of RSA and RD-RSA and compare their predictions. We find that RD-RSA is similar to RSA in its ability to account for human behavior, while avoiding a bias of RSA toward non-informative random utterance production. Taken together, these results suggest that human pragmatic reasoning may be understood in terms of RD theory.

### The Rational Speech Act framework

We begin by reviewing the formal setup of the RSA framework, which forms the basis for our analysis. The RSA framework provides a class of models of pragmatic reasoning, based on a reference game involving a speaker and a listener. The game is defined by a set of meanings (referents) , a set of utterances

, and a lexicon

that determines the literal meanings of each utterance (see Figure (a)a for example). Given a meaning drawn from a prior distribution , the speaker communicates to the listener by producing an utterance according to a production distribution . Upon hearing this utterance, the listener interprets it according to an inference distribution . RSA recursively relates the speaker and listener by assuming that the speaker rationalizes about the listener’s inferences, and that the listener is Bayesian with respect to the speaker’s distribution. This recursion is typically initialized with a literal listener that makes inferences based on the literal meaning of utterances; that is, . The pragmatic speaker is bounded-rational with respect to a utility function , typically defined by

 VL(m,u)=logL(m|u)−C(u), (1)

where specifies the cost of . The speaker and listener at recursion depth are defined by

 St(u|m) ∝ eαVt−1(u,m) (2) Lt(m|u) ∝ St(u|m)P(m), (3)

where is simplified notation for , and controls the degree to which the speaker is rational with respect to maximizing utility. The dynamics of this recursive process are illustrated in Figure (b)b. We denote by and the speaker’s and listener’s distributions at the limit .

## 2 Understanding RSA dynamics as Alternating–Maximization

Our first theoretical result is that the RSA recursion optimizes a tradeoff between maximizing expected utility,

 ES[VL]=∑m,uP(m)S(u|m)VL(m,u), (4)

and maximizing the conditional entropy of the speaker’s production distribution,

 HS(U|M)=−∑m,uP(m)S(u|m)logS(u|m). (5)

More precisely, we show that for any , RSA’s pragmatic interlocutors jointly maximize

 Gα[S,L]=HS(U|M)+αES[VL]. (6)

In this view, does not trade off against recursion depth, as is widely understood Frank . (2018). Rather, the value of determines the tradeoff between one conception of communicative effort, , and expected utility that is optimized by RSA recursion.

Importantly, this result holds without assuming that and satisfy equations (2) and (3) respectively. Instead, the optimization is taken over all valid speaker and listener distributions, and the RSA equations emerge as characterizing the optimal agents. The main idea of the proof is to take the derivatives of with respect to and , and equate these derivatives to zero to obtain conditions for optimality. The formal statement and proof of this claim are given in the Supplementary Material (SM, Proposition 1). Here, we discuss the implications of this result and demonstrate it numerically.

First, optimizing can be interpreted as a type of least-effort principle Zipf (1949). Maximizing amounts to minimizing the expected listener’s surprisal and the cost of utterances. Maximizing amounts to minimizing communicative effort, measured by the deviation from a random production of utterances. These two terms compete: attaining the maximal expected utility requires a very precise selection of utterances, leading to low entropy (high effort), while attaining minimal communicative effort requires a random production distribution, leading to low expected utility. Therefore, least-effort optimization may give rise to human pragmatic reasoning as captured by RSA.

Second, Proposition 1 in the SM implies that the RSA recursion implements an instance of the alternating maximization algorithm Csiszár  Shields (2004). That is, given a listener it holds that

 St=argmaxSGα[S,Lt−1], (7)

and given a speaker it holds that

 Lt=argmaxLGα[St,L]. (8)

In particular, this means that is monotonically non-decreasing with recursion depth. That is, for every

 Gα[St,Lt−1]≤Gα[St,Lt]≤Gα[St+1,Lt]. (9)

This process is demonstrated by the model simulation of Figure 2. Because is bounded from above, it follows that RSA iterations are guaranteed to converge,111To be precise, this implies that the value of is guaranteed to converge to a local optimum. For stronger convergence conditions see Csiszár  Tusnády (1984). and the fixed points of this recursion are stationary points of .

This result disconfirms a known conjecture about RSA dynamics. Because the RSA speaker is guided by (soft) optimization of utterance utility, the intuition is widely held that RSA recursion locally maximizes expected utility (e.g., yuan2018,peloquin-etal:2019). However, our analysis reveals that RSA recursion optimizes , and this does not imply that the expected utility is maximized. In fact, it is possible to show that the listener maximizes the expected utility, but not the speaker. Intuitively, this holds because depends only on , and therefore in practice the listener’s update step (8) maximizes only the expected utility, whereas the speaker’s update step (7) trades it off with communicative effort, which may result in lower expected utility. To see this more formally, fix and note that

 ESt[Vt] = 1α(Gα[St,Lt]−HSt(U|M)) (10) = 1α(maxLGα[St,L]−HSt(U|M)) (11) = maxLESt[VL], (12)

which implies in particular that at every recursion depth . However, the expected utility may decrease due to the speaker update step, that is, it is possible that .

This observation is exemplified in Figure 3

. To this end, we considered a standard RSA model with three uniformly distributed meanings and three possible utterances. The lexicon in this example is a graded

222While the lexicon is often taken to be binary, graded lexica have also been notably considered (e.g.,  yuan2018). version of the lexicon shown in Figure (a)a, as can be seen by the structure of the literal listener (Figure (c)c, ). For simplicity, we take . Consistent with our theoretical analysis, RSA iterations always improve (Figure (a)a), but expected utility may increase (Figure (b)b, blue trajectory), or decrease (red trajectory), depending on . Our analysis also implies that in cases where the initial state has maximal given hard lexical constraints (i.e., structural zeros in the lexicon arising when some messages do not satisfy the truth conditions of some utterances), the expected utility will not decrease. We speculate that the possibility of RSA iteration decreasing expected utility has not previously been identified in numeric simulations because RSA initializations are typically already high in speaker conditional entropy.

In the following sections, we build on this new interpretation of RSA in order to ground it in a fundamental information-theoretic principle, and to study analytically several properties of these models, including the influence of and the asymptotic behavior of their dynamics.

## 3 Grounding RSA in Rate–Distortion theory

Returning to the general communication setup of RSA, the speaker can be seen as a probabilistic encoder and the listener as a probabilistic decoder. From an information-theoretic perspective, optimal encoder-decoder pairs in this setup are characterized by Rate–Distortion (RD) theory Shannon (1948) — the subfield of information-theory that concerns efficient source coding with respect to a fitness function (or distortion) between a target message (speaker’s meaning) and a reconstructed message (listener’s interpretation). In this view, the speaker and listener should jointly optimize the tradeoff between maximizing the expected utility and minimizing the number of bits required for communication. The latter is captured by minimizing the mutual information between speaker meanings and utterances

 IS(M;U)=HS(U)−HS(U|M), (13)

where is the entropy of the marginal distribution of speaker utterances . In other words, from a RD perspective, the speaker and listener should jointly minimize the tradeoff

 Fα[S,L]=IS(M;U)−αES[VL]. (14)

This type of RD tradeoff is closely related to the RSA tradeoff. This can be seen by plugging equations (13) and (6) into (14), which gives

 Fα[S,L]=HS(U)−Gα[S,L]. (15)

We therefore refer to the optimization of as RD-RSA.

The key difference between RSA and RD-RSA is the utterance entropy term . RSA optimization, i.e., maximizing , is equivalent to minimizing while also maximizing . Minimizing alone yields a modified model of pragmatic reasoning. Following a similar derivation as the derivation of RSA as alternating–maximization, we show that RD-RSA predicts the following recursive reasoning process (SM, Proposition 2):

 St(u|m) ∝ St−1(u)exp(αVt−1(m,u)) (16) St(u) = ∑mSt(u|m)P(m) (17) Lt(m|u) ∝ St(u|m)P(m). (18)

These update equations also implement an alternating optimization algorithm, this time with respect to . The optimal RD-RSA listener is Bayesian (18), exactly as in RSA. However, the optimal pragmatic speaker (16) differs from the RSA speaker (2) in that it weights the soft-max utility term by the marginal utterance probability . One might be inclined to think that this adjustment is simply a special instance of RSA with cost function . We wish to emphasize that this is not the case. is not pre-determined, as a cost function would be in RSA, but rather changes with each iteration as the speaker reasons about the listener.

This analysis shows that with a small adjustment, RSA can be grounded in RD theory. While RD-RSA is closely related to RSA, the theoretical motivation and precise predictions of these two principles are different. Next, we compare several properties of RSA and RD-RSA in order to gain insight into which principle might better characterize human pragmatic reasoning.

## 4 Properties of RSA and RD-RSA

Building on the results presented above, we study the asymptotic tendencies of RSA and RD-RSA and evaluate their predictions on existing experimental data.

### Asymptotic behavior and criticality of α=1

To gain a better understanding of the dynamics of RSA and RD-RSA recursion, we analyze their asymptotic behavior. That is, we focus on the set of optimal pairs as a function of . For RSA, these are the fixed points of equations (7) and (8) (or equivalently, (2) and (3)), and for RD-RSA these are the fixed points of equations (16)-(18). This reveals the general tendencies of the model dynamics, and shows how the value of

may influence the model predictions. Therefore, this asymptotic analysis is useful even if human pragmatic reasoning, to the extent that it resembles this recursive process, might be confined to a small number of iterations. As before, the formal proofs are provided in the SM (Section 3), and here we discuss these results while focusing on the main conclusions.

We begin by considering a basic RSA setup in which the speaker’s meanings are distributed uniformly, , and the number of unique utterances is the same as the number of meanings. In addition, we first assume a graded lexicon , where the values defining the applicability of utterances to meanings can be arbitrarily small but not zero. We prove in the SM that in both RSA and RD-RSA, is a critical point at which the optimization dynamics changes its direction. That is, there are two regimes: , in which the non-informative solution (i.e., maximal in RSA, and in RD-RSA) is optimal; and , in which the maximal-utility solution (i.e., ) is optimal. This transition can be clearly seen for RSA in Figure (b)b, and similar simulations of RD-RSA exhibit the same behavior (not shown). Our analysis in the SM also shows that RSA and RD-RSA differ at . At this point, in RD-RSA, but not in RSA, all fixed points are globally optimal.

Next, we consider the same basic setup but allow structural zeros in the lexicon. In this case, if there exists a maximal-utility solution (e.g., a bijection from meanings to utterances) that does not violate the lexicon, which is often the case, then it remains a global optimum for , as in the case of a graded lexicon. On the other hand, the non-informative solution typically violates the lexicon, leading to a different behavior in the regime of . We demonstrate this by the model simulations shown in Figure 4 (top), using the binary lexicon of Figure (a)a. As expected, for both RSA and RD-RSA converge to the maximal-utility point (blue and green trajectories), and for RD-RSA converges immediately, while RSA still converges to the maximal-utility point (purple trajectories). For , the models cannot reach the non-informative solution because it violates the lexicon (red trajectories). In this case, RD-RSA’s trajectory moves toward the non-informative solution but converges at a solution with . RSA’s trajectory moves in the other direction, but importantly it converges at a solution with lower expected utility compared to those of .

Finally, we note that considering a non-trivial cost , or a two-positional cost , may change substantially the dynamics of the models. We present in the SM a preliminary analysis of the influence of the cost function on the behavior of the models, and leave to future work a more comprehensive analysis of this case.

### Comparison with human behavior

We have seen thus far that even though RD-RSA is closely related to RSA, it may generate different predictions. This raises the question how well can RD-RSA account for human behavior compared to RSA. To address this, we consider data from an online reference game experiment conducted by Vogel2014. Each trial presents a set of target objects and a set of possible speaker utterances333Several different stimulus types were presented during the experiment, but the structure of the lexicon was consistent across trials. Note that we only consider data from the Complex condition. that conform to the lexicon shown in Figure (a)a. Participants were then given a speaker utterance, and were asked to indicate which of the target objects they think the speaker refers to. As explained in Figure 1

, this experimental setup invites a complex pattern of pragmatic reasoning. We estimated an empirical human listener from the responses recorded by Vogel et al., and compared that to the pragmatic listeners predicted by RSA and RD-RSA with the lexicon of

Figure (a)a. Following Frank2018, who have previously presented an RSA account of these data, we assume a uniform prior over meanings and no utterance cost. This setting corresponds to the model simulations of Figure 4 (top), discussed earlier. Figure 4 (bottom) shows the correlation between the model predictions and the behavioral data as a function of recursion depth. It can be seen that the predictions of both RSA and RD-RSA improve in the first few iterations, and then deteriorate with depth (except for in RD-RSA, which converges immediately). This is consistent with prior work on RSA (e.g., Frank2018). While the value of and depth that best fit the data differ between RSA (, depth 1) and RD-RSA (, depth 5), their correlation is similar (RSA: , RD-RSA: ), and so are their predicted listeners (see SM Section 4). Therefore, RD-RSA is comparable to RSA in its ability to account for human behavior in this task.

### Conditional entropy or mutual information?

Last, we demonstrate an important implication of the key difference between RSA and RD-RSA, that is, maximizing rather than minimizing . As noted before, this difference boils down to RSA maximizing while also optimizing the RD-RSA objective. This implies that the RSA speaker, as opposed to the RD-RSA speaker, is biased toward random utterance productions. To demonstrate this, we ran similar simulations as in Figure 4, but now with an extra utterance that can be applied to all meanings (e.g., friend). Figure 5 shows the optimal speakers in this case for two values of . For there is strong pressure to minimize communicative effort. In this case, the RSA speaker uses all utterances almost uniformly, which does not convey much information to the listener, while the RD-RSA speaker follows an intuitively simpler production distribution that exclusively uses the extra utterance. For there is strong pressure to maximize expected utility. In this case, the RSA speaker uses the extra utterance to randomize the description of one referent, because this increases entropy without changing the expected utility. In RD-RSA this speaker is also optimal, but so is the speaker that uses a single unique utterance for each referent (shown in the bottom right of Figure 5). Therefore, RD-RSA does not predict a cognitive bias toward randomness.

## 5 Discussion

Pragmatic reasoning is a crucial aspect of human language, often understood within the Rational Speech Act (RSA) framework. While this framework has been remarkably successful in explaining a wide range of psycholinguisic phenomena, it has not been cast in terms of a general optimization principle, leaving open the question of whether such a principle exists for characterizing human pragmatic behavior. Here, we have addressed this open question by presenting a novel information-theoretic analysis of the RSA framework. We have shown that RSA’s recursive reasoning can be derived from least-effort optimization, in contrast to a widely held view of RSA as implementing a heuristic process for maximizing utility. Furthermore, we have shown that with a small adjustment, RSA can be grounded in a more fundamental optimization principle, RD-RSA, which is based on Shannon’s Rate–Distortion (RD) theory.

We believe that RD-RSA is particularly noteworthy for several reasons. First, as our results suggest, RD-RSA avoids RSA’s bias toward non-informative random productions, while maintaining similar ability to account for human behavior. Our analysis has been based only on data from Vogel2014, and so an important direction for future work is to further test the predictions of RD-RSA on additional experimental data.

Second, RD-RSA not only suggests a general information-theoretic principle that may give rise to human pragmatic reasoning, but also provides a potential theoretical link between pragmatics and several other aspects of language Zaslavsky . (2018); Hahn . (2020) and cognition Sims (2016, 2018); Gershman (2020) to which RD theory has recently been applied successfully. We also note that RSA’s optimization principle (6) suggests interesting theoretical links to other related frameworks, such as the recent Optimal Transport approach to cooperative communication Wang . (2019), although the latter applies only to a single iteration of pragmatic reasoning in the RSA framework (but see yuan2018, for the special case of ).

Finally, we argue that RD-RSA addresses a major concern about the applicability of information theory to pragmatics. As noted by Sperber and Wilson: “there is a gap between the semantic representations of sentences and the thoughts actually communicated by utterances. This gap is filled not by more coding, but by inference” (Sperber1986, p. 9). RD theory lies at the intersection of information theory and statistical inference, and thus it may capture both aspects of coding and inference in pragmatics, and perhaps in language more generally.

## Acknowledgments

N.Z. was supported by a BCS Fellowship in Computation; J.H. was supported by the NIH under award number T32NS105587 and an NSF Graduate Research Fellowship; R.P.L. was supported by NSF grant BCS-1456081, a Google Faculty Research Award, and Elemental Cognition.

## References

• Bergen . (2016) bergen-levy-goodman:2016-spBergen, L., Levy, R.  Goodman, N.  2016. Pragmatic Reasoning through Semantic Inference Pragmatic reasoning through semantic inference. Semantics and Pragmatics920.
• Camerer . (2004) Camerer2004Camerer, CF., Ho, T.  Chong, JK.  2004. A Cognitive Hierarchy Model of Games A cognitive hierarchy model of games. The Quarterly Journal of Economics1193861-898.
• Csiszár  Shields (2004) Csiszar2004Csiszár, I.  Shields, P.  2004. Information Theory and Statistics: A Tutorial Information theory and statistics: A tutorial. Foundations and Trends in Communications and Information Theory14417-528.
• Csiszár  Tusnády (1984) Csiszar1984Csiszár, I.  Tusnády, GE.  1984. Information geometry and alternating minimization procedures Information geometry and alternating minimization procedures. Statistics and Decisions, Supplemental Issue1205–237.
• Frank . (2018) Frank2018Frank, MC., Emilsson, AG., Peloquin, B., Goodman, ND.  Potts, C.  2018. Rational Speech Act models of pragmatic reasoning in reference games Rational Speech Act models of pragmatic reasoning in reference games. PsyArXiv.
• Frank  Goodman (2012) Frank2012Frank, MC.  Goodman, ND.  2012. Predicting pragmatic reasoning in language games Predicting pragmatic reasoning in language games. Science3366084998–998.
• Franke  Degen (2016) Franke2016Franke, M.  Degen, J.  2016. Reasoning in Reference Games: Individual- vs. Population-Level Probabilistic Modeling Reasoning in reference games: Individual- vs. population-level probabilistic modeling. PLOS ONE1151-25.
• Gershman (2020) Gershman2020Gershman, SJ.  2020. Origin of perseveration in the trade-off between reward and complexity Origin of perseveration in the trade-off between reward and complexity. bioRxiv. 10.1101/2020.01.16.903476
• Goodman  Frank (2016) Goodman2016Goodman, ND.  Frank, MC.  2016. Pragmatic language interpretation as probabilistic inference Pragmatic language interpretation as probabilistic inference. Trends in Cognitive Sciences2011818-829.
• Hahn . (2020) Hahn2020Hahn, M., Jurafsky, D.  Futrell, R.  2020. Universals of word order reflect optimization of grammars for efficient communication Universals of word order reflect optimization of grammars for efficient communication. PNAS11752347–2353.
• Levy (2018) Levy2018Levy, RP.  2018. Communicative Efficiency, Uniform Information Density, and the Rational Speech Act Theory Communicative Efficiency, Uniform Information Density, and the Rational Speech Act theory. T. Rogers, M. Rau, X. Zhu  C. Kalish (), Proceedings of the 40th Annual Meeting of the Cognitive Science Society Proceedings of the 40th annual meeting of the cognitive science society ( 684–689). Austin, TX: Cognitive Science Society.
• Peloquin . (2019) peloquin-etal:2019Peloquin, BN., Goodman, ND.  Frank, MC.  2019. The interactions of rational, pragmatic agents lead to efficient language structure and use The interactions of rational, pragmatic agents lead to efficient language structure and use. A. Goel, C. Seifert  C. Freksa (), Proceedings of the 41st Annual Meeting of the Cognitive Science Society Proceedings of the 41st annual meeting of the Cognitive Science Society ( 912–917). Montreal, QB: Cognitive Science Society.
• Shannon (1948) Shannon1948Shannon, CE.  1948. A mathematical theory of communication A mathematical theory of communication. Bell System Technical Journal27.
• Sims (2016) Sims2016Sims, CR.  2016. Rate–distortion theory and human perception Rate–distortion theory and human perception. Cognition152181 - 198.
• Sims (2018) Sims2018Sims, CR.  2018. Efficient coding explains the universal law of generalization in human perception Efficient coding explains the universal law of generalization in human perception. Science3606389652–656.
• Sperber  Origgi (2010) Sperber2010Sperber, D.  Origgi, G.  2010. A pragmatic perspective on the evolution of language A pragmatic perspective on the evolution of language. RK. Larson, V. Déprez  H. Yamakido (), The Evolution of Human Language: Biolinguistic Perspectives The evolution of human language: Biolinguistic perspectives ( 124–132). Cambridge Univ. Press. 10.1017/CBO9780511817755.009
• Sperber  Wilson (1986) Sperber1986Sperber, D.  Wilson, D.  1986. Relevance: Communication and Cognition Relevance: Communication and cognition. Cambridge, MAHarvard Univ. Press.
• Traugott (2012) Traugott2012Traugott, EC.  2012. Pragmatics and language change Pragmatics and language change. K. Allan  KM. Jaszczolt (), The Cambridge Handbook of Pragmatics The cambridge handbook of Pragmatics ( 549–566). Cambridge University Press. 10.1017/CBO9781139022453.030
• Vogel . (2014) Vogel2014Vogel, A., Emilsson, AG., Frank, MC., Jurafsky, D.  Potts, C.  2014. Learning to Reason Pragmatically with Cognitive Limitations Learning to reason pragmatically with cognitive limitations. P. Bello, M. Guarini, M. McShane  B. Scassellati (), Proceedings of the 36th Annual Meeting of the Cognitive Science Society Proceedings of the 36th annual meeting of the Cognitive Science Society ( 3055–3060). Austin, TX: Cognitive Science Society.
• Wang . (2019) Wang2019Wang, P., Wang, J., Paranamana, P.  Shafto, P.  2019. A mathematical theory of cooperative communication A mathematical theory of cooperative communication. arXiv preprint arXiv:1910.02822v1. https://arxiv.org/pdf/1910.02822v1.pdf
• Yuan . (2018) yuan2018Yuan, A., Monroe, W., Bai, Y.  Kushman, N.  2018. Understanding the Rational Speech Act Model Understanding the Rational Speech Act model. T. Rogers, M. Rau, X. Zhu  C. Kalish (), Proceedings of the 40th Annual Conference of the Cognitive Science Society Proceedings of the 40th annual conference of the cognitive science society ( 2759–2764). Austin, TX: Cognitive Science Society.
• Zaslavsky . (2018) Zaslavsky2018Zaslavsky, N., Kemp, C., Regier, T.  Tishby, N.  2018. Efficient compression in color naming and its evolution Efficient compression in color naming and its evolution. PNAS115317937–7942.
• Zipf (1949) Zipf1949Zipf, GK.  1949. Human Behavior and the Principle of Least Effort Human behavior and the principle of least effort. Addison-Wesley (Reading MA).

## 1 Understanding RSA dynamics as Alternating Maximization

Here we prove that RSA recursion implements an alternating maximization algorithm for optimizing

 Gα[S,L]=HS(U|M)+αES[VL]. (19)

Before doing so, we introduce several required definitions and notations. First, we formally define an RSA reference game by a tuple , where is a finite set of meanings (referents), is a finite set of utterances, is a prior distribution over , is a lexicon, and is a cost function. We formally define a lexicon by a mapping , such that if can be applied to and otherwise. Unless stated otherwise, we assume that , namely that is a non-negative utterance cost function. More generally, could also be a two-positional cost function, i.e., . Next, we define the set of all speaker and listener distributions that do not violate a given lexicon . Denote by

the simplex of probability distributions over

, and by

the set of all conditional probability distributions of

given . Similarly, denote by the set of all conditional probability distributions of given . The set of all possible speakers that do not violate the lexicon is then

 Sl={S∈△(U)M:S(u|m)=0 if l(m,u)=0}, (20)

and the set of all possible listeners that do not violate the lexicon is

 Ll={L∈△(M)U:L(m|u)=0 if l(m,u)=0}. (21)

It is easy to verify that and are convex sets.

###### Proposition 1 (RSA optimization).

Let . The following statements hold for RSA:

• RSA recursion implements an alternating maximization: for all , for a fixed it holds that

 St=argmaxS∈△(U)MGα[S,Lt−1], (22)

and for a fixed it holds that

 Lt=argmaxL∈△(M)UGα[St,L], (23)

where and are RSA’s speaker and listener distributions at recursion depth .

• If then , and if then . That is, RSA iterations do not violate the hard lexicon constraints.

• The fixed points of the RSA recursion are stationary points of .

###### Proof.

First, fix and note that the function is concave in . To find a maximizer for over we define the Lagrangian

 L[S;λ]=g(S)−∑mλ(m)∑uS(u|m),

where are the normalization Lagrange multipliers.111We omit the non-negativity constraints because these constraints are inactive. Note that if for some and it holds that and , then . Therefore, at the maximum, it necessarily holds that if then also (following the convention that ). In particular, this implies that if then . That is, if does not violate the lexicon, then maximizing is guaranteed to give a speaker that also does not violate the lexicon. Taking the derivative of with respect to , for every and such that , gives

 ∂L∂S(u|m)=P(m)[−logS(u|m)−1+αVt−1(m,u)]−λ(m).

Equating these derivatives to zero gives RSA’s speaker (equation (2) in the main text), as a necessary condition for optimality. Because is concave, this is also a sufficient condition for this step.

Next, fix and consider the function . This function is concave in . To find a maximizer for over , we define as before the corresponding Lagrangian and take its derivative with respect to . This gives

 ∂L∂L(m|u)=αP(m)St(u|m)1L(m|u)−λ(u).

Equating this derivative to zero gives RSA’s Bayesian listener (equation (3) in the main text) as a necessary condition for optimality. Because is concave, this is also a sufficient condition for this step. It is also easy to verify that if then , and therefore if then .

Finally, at a fixed point both the derivatives with respect to and to are zero. Because these are also the derivatives of over , it holds that is a stationary point of . Note that is not jointly concave in and in , and therefore, is not necessarily a global maximum. ∎

## 2 Derivation of RD-RSA

In this section we derive the RD-RSA update equations from the minimization of

 Fα[S,L]=IS(M;U)−αES[VL]. (24)

This can be seen as a type of Rate–Distortion (RD) optimization problem with a variable distortion measure between meanings and utterances.

###### Proposition 2 (Rd-Rsa).

Let and . Given , and are stationary points of if and only if they satisfy the following self-consistent conditions:

 S(u|m) ∝ S(u)exp(αVL(m,u)) (25) S(u) = ∑mS(u|m)P(m) (26) L(m|u) = S(u|m)P(m)S(u) (27)
###### Proof.

The main idea of the proof is to take the derivatives of w.r.t. , , and , and equate these derivatives to zero, which gives the RD-RSA equations (25)-(27) as necessary conditions for optimality. This derivation is similar to the derivation in the proof of Proposition 1, and therefore we do not repeat it here. ∎

Note that is convex in , , and , although it is not jointly convex in these three distributions. Therefore, similar to the proof of Proposition 1, it holds that can be optimized via an alternating minimization algorithm that iteratively updates equations (25)-(27), as described in the main text. However, because is not jointly convex in these variables, this iterative algorithm will not necessarily converge to a global optimum.

## 3 Asymptotic behavior and the criticality of α=1

In this section we analyze the asymptotic behavior of RSA and RD-RSA dynamics. In both cases, we focus mainly on the basic RSA setup discussed in the main text. We also present preliminary analysis of the influence of the cost function.

### 3.1 Rsa

Denote by the maximal value of given , and let and be optimal speaker and listener distributions that attain . That is, . The following proposition characterizes , , and , as a function of , in a basic RSA setup.

###### Proposition 3 (Asymptotic behavior of RSA).

Let be a constant function, be the uniform distribution over , and assume . In addition, assume a graded lexicon with no structural zeros. Then the following statements hold:

1. For , and .

2. For , and are deterministic distributions defined by a bijection from to .

###### Proof.

We prove these claims by first deriving an upper bound on and then showing that the given and attain this bound in the two regimes of . Assume w.l.o.g. that , and let be the posterior distribution with respect to and . For any and it holds that

 Gα[S,L] ≤ HS(U|M)+αES[logS(m|u)] (28) = HS(U|M)+αES[logS(u|m)P(m)S(u)] (29) = (α−1)IS(M;U)+HS(U)−αH(M). (30)

Equation (28) follows from the fact that for any two distributions, and specifically for and . Equation (29) follows from substituting Bayes’ rule, and (30) from the definition of entropy and the identity .

For , this bound is maximal when and . Therefore, in this regime , and it is easy to verify that and attain this bound (simply substitute these distributions in the definition of ). Specifically, when is uniform, it holds that , and therefore . For it holds that , and therefore . When is uniform, this bound becomes . Let be a bijection, and set and . In this case, and , following the convention that . Therefore, these distributions attain the bound for . Putting everything together gives , which concludes the proof. ∎

Proposition 3 shows that in the basic RSA setup that corresponds to its assumptions, there is only one critical value , which determines the global optimum of and the asymptotic tendency of the RSA dynamics. However, when is not a constant function there could be multiple critical values . We next show that in this case the first critical value , at which the non-informative solution looses its optimality, is . To see this, notice that adding the utterance cost to the bound in (30) gives

 Gα[S,L] ≤ (α−1)IS(M;U)+HS(U)−αH(M)−αES[C(U)] (31) = (α−1)IS(M;U)−D[S(u)∥Qα(u)]+logZα−αH(M) (32)

where is the maximum entropy distribution over with respect to defined by

 Qα(u)=e−αC(u)Zα,Zα=∑ue−αC(u)

and is the Kullback-Leibler (KL) divergence. For , the first two terms in (32) are non-positive and therefore (32) can be further bounded from above, yielding

 Gα[S,L]≤logZα−αH(M). (33)

It is easy to verify that this upper bound is attained by and (as before, to see this substitute these distribution in ). In this case, changes continuously for , even though these changes do not convey any information to the listener. In other words, in this regime, the RSA model predicts that a pragmatic speaker will not try to convey any information to the listener (), but will rather seek the minimal deviation from random utterance production that reduces the expected utterance cost to a tolerable degree, determined by . This is another demonstration of RSA’s bias toward random utterance production. Finally, we note that more generally, if the cost function is two-positional, that is , then it is possible that .

### 3.2 Rd-Rsa

Next, we characterize the asymptotic behavior of RD-RSA in the basic setup discussed in the main text. Denote by the minimal value of given , and let and be optimal speaker and listener distributions that attain .

###### Proposition 4.

Let be a constant function, then the following statements hold for RD-RSA:

1. For ,

 S∗α∈argminSIS(M;U)
2. For ,

 S∗α∈argmaxSIS(M;U)
3. For , and all stationary points are optimal.

###### Proof.

The idea of the proof is similar to the proof of Proposition 3. Here, however, we derive a lower bound for . For this, we take similar steps as in (28)-(29) but adapt them to by replacing the conditional entropy in the first term by and changing the sign of the second term. This gives the lower bound

 Fα[S,L] ≥ (1−α)IS(M;U)+αH(M). (34)

For , this lower bound is minimal when is minimal, i.e. when , which is attained by a non-informative speaker, e.g. . For , this lower bound is minimal when is maximal. Therefore, the optimum in this regime is given by . If there exists a bijection that does not violate the lexicon, then and attain this bound. Finally, for , any fixed point of the RD-RSA equations gives

 ES∗[logL∗(m|u)]=ES∗[logS∗(m|u)]=−H∗S(M|U), (35)

and therefore . This means that all fixed points are equally good in this regime. Note also that the lower bound (34) in this case becomes . ∎

## 4 Comparison with human behavior

In the main text we have shown that both RSA and RD-RSA produce listener distributions that are highly correlated with the empirical human listener estimated from the experimental data of Vogel et al. (2014). Here we supplement that evaluation with the figure below, which shows that the best RSA listener and the best RD-RSA listener are indeed very similar to each other and to the empirically estimated human listener.