From partners to populations: A hierarchical Bayesian account of coordination and convention

04/12/2021 ∙ by Robert D. Hawkins, et al. ∙ Princeton University 10

Languages are powerful solutions to coordination problems: they provide stable, shared expectations about how the words we say correspond to the beliefs and intentions in our heads. Yet language use in a variable and non-stationary social environment requires linguistic representations to be flexible: old words acquire new ad hoc or partner-specific meanings on the fly. In this paper, we introduce a hierarchical Bayesian theory of convention formation that aims to reconcile the long-standing tension between these two basic observations. More specifically, we argue that the central computational problem of communication is not simply transmission, as in classical formulations, but learning and adaptation over multiple timescales. Under our account, rapid learning within dyadic interactions allows for coordination on partner-specific common ground, while social conventions are stable priors that have been abstracted away from interactions with multiple partners. We present new empirical data alongside simulations showing how our model provides a cognitive foundation for explaining several phenomena that have posed a challenge for previous accounts: (1) the convergence to more efficient referring expressions across repeated interaction with the same partner, (2) the gradual transfer of partner-specific common ground to novel partners, and (3) the influence of communicative context on which conventions eventually form.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 31

page 33

page 34

page 36

page 38

page 39

page 41

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Convention formation as Hierarchical Bayesian inference

In this section, we propose a unified computational account of ad hoc coordination and convention formation that aims to address these three empirical puzzles. We begin from first principles: What is the core computational problem that must be solved to achieve successful communication? Classically, this problem has been formulated in terms of coding and compression Shannon48. An intended meaning in the speaker’s mind must be encoded as a signal that is recoverable by the receiver after passing through a noisy transmission channel. This literal transmission problem has since been enriched to account for pragmatics – the ability of speakers and listeners to use context and social knowledge to draw inferences beyond the literal meaning of messages sperber1986relevance. We take the Rational Speech Act framework <RSA;¿frank_predicting_2012,goodman_pragmatic_2016,FrankeJager16_ProbabilisticPragmatics as representative of this current synthesis, formalizing communication as recursive social inference in a probabilistic model. In the next section, we will review this basic framework and then raise two fundamental computational problems facing this framework that motivate our proposal.

1.1 RSA models of communication with static meaning

In our referential communication setting111For concreteness, we restrict our scope to reference in a discrete context of objects, but the same formulation applies to more general spaces of meanings., the RSA framework defines a pragmatic speaker, denoted by , who must choose an utterance that will allow their partner to choose a particular target object from the current communicative context : They attempt to satisfy the Gricean Maxims Grice75_LogicConversation by selecting utterances according to a utility function that balances informativity to an imagined listener against the cost of producing an utterance. Specifically, chooses from a “softmax distribution” concentrating mass on the utterance that maximizes to an extent modulated by a parameter . As , the speaker increasingly chooses the utterance with maximal utility.

(1)

where is a function giving the cost of producing , assuming a longer utterances are more costly. The speaker thus has two free parameters: and which controls the relative weight of informativity and parsimony in the speaker’s production.

The imagined literal listener in Eq. 1 is assumed to identify the target using a lexical meaning function capturing the literal semantics of the utterance

. That is, the probability of the literal listener choosing object

is proportional to the meaning of under a static lexical meaning function :

Throughout this paper, we will take to be a traditional truth-conditional function evaluating whether a given object is in the extension of the utterance222

Note that the normalization constant may be exactly zero for some possible lexicons – for instance, if a given utterance is literally false of all objects in context – in which case these distributions are not well-defined. See Appendix A for technical details of how we address this problem.

:

However, there are many alternative representational choices compatible with our core model, including fuzzier, continuous semantics degen2020redundancy

or vector embeddings learned by a neural network

(potts2019case, see Appendix B for examples), which may be more appropriate for scaling the model to larger spaces of words and referents. We return to these possibilities in the General Discussion.

1.2 Two fundamental problems for static meaning

This basic framework and its extensions have accounted for a variety of important phenomena in pragmatic language use <e.g.¿Scontras_problang,KaoWuBergenGoodman14_NonliteralNumberWords,TesslerGoodman16_Generics,LassiterGoodman15_AdjectivalVagueness. Yet it retains a key assumption from classical models: that the speaker and listener must share the same literal “protocol” for encoding and decoding messages. In this section, we highlight two under-appreciated challenges of communication that complicate this assumption.

The first challenge is variability in linguistic meaning throughout a language community. Different listeners may recover systematically different meanings from the same message, and different speakers may encode the same message in different ways. For example, doctors may fluently communicate with one another about medical conditions using specialized terminology that is meaningless to a patient. The words may not be in the patient’s lexicon, and even common words may be used in non-standard ways. That is, being fluent speakers of the same language does not ensure perfect overlap for the relevant meanings that need to be transmitted in every context: different partners may simply be using different functions .

The second challenge is the non-stationarity of the world. Agents are continually presented with new thoughts, feelings, and entities, which they may not already have efficient conventions to talk about. For example, when new technology is developed, the community of developers and early adopters must find ways of referring to the new concepts they are working on (e.g. e-mailing, the Internet). Or, when researchers design a new experiment with multiple conditions, they must find ways of talking about their own ad hoc abstractions, often converging on idiosyncratic names that can be used seamlessly in meetings. That is, any literal protocol that we may write down at one time would be quickly outdated at a later time <

see¿[for a demonstration of the related problems posed by non-stationary for large neural language models]lazaridou2021pitfalls. We must have some ability to extend our language on the fly as needed.

1.3 A hierarchical model of dynamic meaning

Rather than assuming a monolithic, universally shared language, we argue that agents solve the core problems posed by variability and non-stationarity by attempting to continually, adaptively infer the system of meaning used by each partner, in context. When all agents are continually learning in this way, we will show that they are not only able to locally coordinate on ad hoc meanings with specific partners but also able to abstract away linguistic conventions that are expected to be shared across an entire community. We introduce our model in three steps, corresponding to three core capacities: hierarchical uncertainty about meaning, online partner-specific learning, and inductive generalization.

C1: Hierarchical uncertainty about meaning

When an agent encounters a communication partner, they must call upon some representation about what they expect different signals will mean to that partner. We therefore replace the static function with a parameterized family of lexical meaning functions by , where different values of yield different possible systems of meaning. To expose the dependence on a fixed system of meaning, Eq. 1 can be re-written to give behavior under a fixed value of :

(2)

While we will remain agnostic for now to the exact functional form of and the exact parameter space of (see Inference details section below), there are two computational desiderata we emphasize.

First, given the challenge of variability raised in the previous section, these expectations ought to be sensitive to the overall statistics of the population. An agent should know that there is tighter consensus about the meaning of dog than the meaning of, say, specialized medical terms like sclerotic aorta Clark98_CommunalLexicons, and conversely, should expect more consensus around how to refer to familiar concepts than new or ambiguous concepts. This desideratum – representing population variability – motivates a probabilistic formulation. Instead of holding a single static function , which an agent assumes is shared perfectly in common ground (i.e. one for the whole population), we assume each agent maintains uncertainty

over the exact meaning of words as used by different partners. In a Bayesian framework, this uncertainty is specified by a prior probability distribution over possible values of

. For example, imagine that under some possible values of , the term “sclerotic aorta” has truth conditions related to a specific condition of the heart, but under other values of , it does not: a well-trained doctor approaching a stranger should not assume their partner is using either but should assign some probability to each case. The introduction of uncertainty over a partner’s literal semantics has previously been explored in the context of one-shot pragmatic reasoning, where it was termed lexical uncertainty bergen_pragmatic_2016, and in the context of iterated dyadic interactions SmithGoodmanFrank13_RecursivePragmaticReasoningNIPS.

Figure 2: Schematic of hierarchical Bayesian model. At the highest level, denoted by , is a representation of aspects of meanings expected to be shared across all partners. These global conventions serve as a prior for the systems of meanings used by specific partners, . These partner-specific representations give rise in turn to predictions about their language use , where represents observations in a communicative interaction with partner . By inverting this model, agents can adapt to local partner-specific conventions and update their beliefs about global conventions.

Second, this representation should also, in principle, be sensitive to the social identity of the partner: a cardiologist should have different expectations about a long-time colleague than a new patient. This desideratum – sensitivity to partner-specific meanings – motivates a hierarchical model, where uncertainty is represented by a multi-level prior. At the highest level of the hierarchy is community-level uncertainty , where represents an abstract “overhypothesis” about the overall distribution of possible partners. then parameterizes the agent’s partner-specific uncertainty , where represents the specific system of meaning used by partner (see Fig. 2). We focus for simplicity on this basic two-layer hierarchy, but the model can be straightforwardly extended to representing uncertainty at intermediate layers of social structure, including whether partners belong to distinct sub-communities (e.g. represented by discrete latent variables) or varying along latent dimensions (e.g. represented by a topic mixture). We return to these possible extensions in the General Discussion.

To integrate lexical uncertainty into our speaker and listener models, we assume they each act in a way that is expected to be successful on average, under likely values of SmithGoodmanFrank13_RecursivePragmaticReasoningNIPS. In other words, they sample actions by marginalizing over their own beliefs or about different meanings their partner may be using.

(3)

where control the speaker’s and listener’s soft-max optimality, respectively333We denote and without a subscript because they are the only speaker and listener models we use in simulations throughout the paper – the subscripted definitions are internal constructs used to define these models – but in the terminology of the RSA framework they represent - and -level pragmatic agents with lexical uncertainty. We found that higher levels of recursion were not strictly necessary to derive the phenomena of interest, but and -level lexical uncertainty models may be generalized by replacing in the listener equation, and in the speaker’s utility definition, with standard RSA definitions of -level agents <e.g.¿zaslavsky2020rate..

C2: Partner-specific online learning

The formulation in Eq. 3 derives how agents ought to act under uncertainty about the lexicon being used by their partner, . But how do beliefs about their partner change over time? Although an agent may begin with significant uncertainty about the system of meaning their partner is using in the current context, further interactions provide useful information for reducing that uncertainty and therefore improving the success of communication. In other words, ad hoc convention formation may be re-cast as an inference problem. Given observations from interactions with partner , an agent can update their beliefs about their partner’s latent system of meaning following Bayes rule:

(4)

This joint inference decomposes the partner-specific learning problem into two terms, a prior term and a likelihood term . The prior term captures the idea that, in the absence of strong evidence of partner-specific language use, the agent ought to regularize toward their background knowledge of conventions: the aspects of meaning that all partners are expected to share in common. The likelihood term represents predictions about how a partner would use language in context under different underlying systems of meaning (as specified in the Referential Feedback section below).

Importantly, the posterior obtained in Eq. 4 allows agents to explicitly maintain partner-specific expectations, as used in Eq. 3, by marginalizing over community-level uncertainty:

(5)

This posterior can be viewed as the “idiolect” that has been fine-tuned to account for partner-specific common ground from previous interactions. We will show that when agents learn about their partner in this way, and adjust their own production or comprehension accordingly (i.e. Eq. 3), they are able to coordinate on stable ad hoc conventions.

C3: Inductive generalization to new partners

The posterior in Eq. 4 also provides an inductive pathway for partner-specific data to inform beliefs about community-wide conventions. Agents update their beliefs about , using data accumulated from different partners, by marginalizing over beliefs about specific partners:

(6)

where , , and is the number of partners previous encountered. Intuitively, when multiple partners are inferred to use similar systems of meaning, beliefs about shift to represent this abstracted knowledge: it becomes more likely that a novel partner in one’s community will share it as well. Note that this population-level posterior over not only represents what the agent has learned about the central tendency of the group’s conventions, but also the spread or variability, capturing the notion that some word meanings may be more widespread than others.

The updated should be used to guide the prior expectations an agent brings into a subsequent interactions with strangers. This transfer is sometimes referred to as “sharing of strength” or “partial pooling” because pooled data is smoothly integrated with domain-specific knowledge. This property has been key to explaining how the human mind solves a range of other difficult inductive problems in the domains of concept learning KempPerforsTenenbaum07_HBM; tenenbaum_how_2011, causal learning KempPerforsTenenbaum07_HBM; KempGoodmanTenenbaum10_LearningToLearn, motor control berniker2008estimating, and speech perception kleinschmidt2015robust. A key consequence of such transfer hierarchical models is the “blessing of abstraction,” GoodmanUllmanTenenbaum11_TheoryOfCausality where it is possible under certain conditions for beliefs about the community’s conventions in general to outpace beliefs about the idiosyncracies of individual partners gershman2017blessing. We return to this property in the context of language acquisition in the General Discussion.

1.4 Further challenges for convention formation

The formulation in the previous section presents the core of our theory. Here, we highlight several additional features of our model, which address more specific challenges raised by prior work on communication and which we will encounter in the simulations reported in the remainder of the paper. Our organization of these details is motivated by the analysis of spike_minimal_2017, who recently distilled three common problems that all accounts of convention must address: (1) the availability of referential feedback, (2) a form of information loss or forgetting, and (3) a systemic bias against ambiguity. Finally, we explain practical details of how we perform inference in this model.

Referential feedback

Learning and adaptation depends on the availability and quality of observations throughout a communicative interaction. If the speaker has no way of assessing the listener’s understanding, or if the listener has no way of comparing their interpretation against the speakers intentions, however indirectly, they can only continue to rely on their prior, with no ground for conventions to form KraussWeinheimer66_Tangrams; HupetChantraine92_CollaborationOrRepitition; GarrodFayLeeOberlanderMacLeod07_GraphicalSymbolSystems. So, what data should each agent use to update their beliefs at a particular point in an interaction?

In principle, we expect that reflects all relevant sources of information that may expose an agent’s understanding or misunderstanding, including verbal and non-verbal backchannels (mmhmm, nodding), clarification questions, and actions taken in the world. In the more minimal setting of a reference game, we use the full feedback provided by the task, where the speaker’s intended target and the listener’s response are revealed at the end of each trial. Formally, this information can be written as a set of tuples , where denotes the speaker’s intended target, denotes the utterance they produced, and denotes the listener’s response, on each previous trial .

Now, to specify the likelihoods in Eq. 4 for our referential setting, we assume each agent should infer their partner’s lexicon by conditioning on their partner’s previous choices. The listener on a given trial should use the probability that a speaker would produce to refer to the target under different , i.e. , and the speaker should likewise use the probability that their partner would produce response after hearing utterance , , This symmetry, where each agent is attempting to learn from the other’s behavior, creates a clear coordination problem444In some settings, agents in one role may be expected to take on more of the burden of adaptation, leading to an asymmetric division of labor <e.g.¿MorenoBaggio14_AsymmetrySignaling. This may be especially relevant in the presence of asymmetries in power, status, or capability. In principle, this could be reflected in differing values of parameters and , but we leave consideration of such asymmetries for future work.. In the case of an error, where the agent in the listener role hears the utterance and chooses an object other than the intended target , they will receive feedback about the intended target and subsequently condition on the fact that the speaker chose to convey that target. Meanwhile, the agent in the speaker role will subsequently condition on the likelihood that the listener chose the object upon hearing their utterance. In other words, each agent will subsequently condition on slightly different data leading to conflicting beliefs. Whether or not agents are able to resolve early misunderstandings through further interaction and eventually reach consensus depends on a number of factors.

Memory and forgetting

One important constraint is imposed by the basis cognitive mechanisms of memory and forgetting. It is unrealistic to expect that memories of every example of every past interaction in the set of observations is equally accessible to the agent. Furthermore, this may be to the agent’s advantage. Without any mechanism for forgetting, early errors may prevent coordination much later in an interaction, as each agent’s lexical beliefs must explain all previous observations equally. Without the ability to discount earlier data points, agents may be prevented from ever reaching consensus with their partner spike_minimal_2017.

Forgetting is typically incorporated into Bayesian models with a decay term in the likelihood function anderson2000adaptive; angela2009sequential; fudenberg2014recency; kalm2018visual.

where indexes the most recent trial and decay increases further back through time. This decay term is motivated by the empirical power function of forgetting wixted1991form, and can be interpreted as the expectation over a process where observations have some probability of dropping out of memory at each time step. Indeed, this likelihood can be derived by simply extending our hierarchical model down an additional layer within each partner to allow for the possibility that they are using slightly different lexicons at different points in time; assuming a degree of auto-correlation between neighboring time points yields this form of discounting. Alternatively, at the algorithmic level, decay can be viewed as a form of weighted importance sampling, where more recent observations are preferentially sampled pearl2010online555While this simple decay model is sufficient for our simulations, it is missing important mechanistic distinctions between working memory and long-term memory; for example, explaining convention formation over longer timescales may require an explicit model of consolidation or source memory for context.

Bias against ambiguity

A third specific challenge is posed by ambiguity: if a speaker uses a label to refer to one target, it is consistent with the data for the listener to subsequently believe that the same expression may be acceptable for other targets as well. In our account, this problem is naturally solved by the principles of pragmatic reasoning instantiated in the RSA framework Grice75_LogicConversation, which has been explicitly linked to mutual exclusivity in word learning bloom2002children; FrankGoodmanTenenbaum09_Wurwur; SmithGoodmanFrank13_RecursivePragmaticReasoningNIPS; gulordava2020one; ohmerreinforcement. Gricean pragmatic reasoning critically allows agents to learn from the absence of evidence by reasoning about alternatives.

Pragmatic reasoning plays two distinct roles in our model. First, Gricean agents assume that their partner is using language in a cooperative manner and account for this when inferring their partner’s language model. That is, we use these equations as the linking function in the likelihood , representing an agent’s prediction about how a partner with meaning function would actually behave in context (Eq. 4). is used to learn from observations generated in the speaker role and is used to learn from observations generated in the listener role. For example, upon hearing their partner use a particular utterance to refer to an object , a pragmatic agent can not only infer that means in their partner’s lexicon, but also that all other utterances likely do not mean : if they did, the speaker would have used them instead. Second, agents do not only make passive inferences from observation, they participate in the interaction by using language themselves. A Gricean agent’s own production and comprehension is also guided by cooperative principles (Eq. 3).

Many minor variations on the basic RSA model have been explored in previous work, and it is worth highlighting three technical choices in our formulation. First, both agents are “action-oriented,” in the sense that they behave proportional to the utility of different actions, according to a soft-max normalization . This contrasts with some RSA applications, where the listener is instead assumed to be “belief-oriented,” simply inferring the speaker’s intended meaning without producing any action of their own qing2015variations. Second, our instantiation of lexical uncertainty differs subtly from the one used by bergen_pragmatic_2016, which placed the integral over lexical uncertainty at a single level of recursion (specifically, within a pragmatic listener agent). Instead, we argue that it is more natural in an interactive, multi-agent setting for each agent to maintain uncertainty at the highest level, such that each agent is reasoning about their partner’s lexicon regardless of what role they are currently playing.

Lastly, we must define how the agent behaves when an utterance is false of all objects in context under their lexicon (i.e. when the normalization constant is exactly 0). Previous proposals have assumed that the should choose uniformly in this case. However, this assumption has the unintuitive consequence that

’s utility of using an utterance known to be false of the target may be the same as an utterance known to be true. To address this concern, we always add a low-prior-probability ‘null object’ to the

’s context, for which all utterances evaluate to true. This alternative can be interpreted as a ‘failure to refer’ and effectively prevents from assigning belief to a referent for which the utterance is literally false (this case is distinct from the case of a contradiction, which arises when defining the meaning of multi-word utterances in Section P1 below.) See Appendix A for further details and discussion of our RSA implementation.

Inference details

While our simulations in the remainder of the paper each address different scenarios, we have aimed to hold as many details as possible constant throughout the paper. First, we must be concrete about the space of possible lexicons that parameterizes the lexical meaning function, . For consistency with previous Bayesian models of word learning <e.g.¿XuTenenbaum07_WordLearningBayesian we take the space of possible meanings for an utterance to be the set of nodes in a concept taxonomy. When targets of reference are conceptually distinct, as typically assumed in signaling games, the target space of utterance meanings reduces to the discrete space of individual objects, i.e.  for all . For this special case, the parameter space contains exactly possible values for , corresponding to all possible mappings between utterances and individual objects. Each possible lexicon can therefore be written as a binary matrix where the rows correspond to utterances, and each row contains one object. The truth-conditional function then simply checks whether the element in row matches object . For example, there are four possible lexicons for two utterances and two objects:

Second, having defined the support of the parameter , we can then define a simplicity prior following FrankGoodmanTenenbaum09_Wurwur, where is the total size of each word’s extension, summed across words in the vocabulary. Again, for traditional signaling games, this reduces to a uniform prior because all possible lexicons are the same size: . We can compactly write distributions over in terms of the same utterance-object matrix, where row represents the marginal distribution over possible meanings of utterance . For example, the uninformative prior for two utterances and two objects can be written:

This prior becomes more important for P3, however, where we consider spaces of referents with more complex conceptual structure and a larger space of possible meanings. A single word may apply to multiple conceptually related referents or, conversely, may have an empty meaning, where it is effectively removed from the agent’s vocabulary. In this more general case, a simplicity prior generically favors smaller word meanings as well as a smaller effective vocabulary size, since the null meaning has the smallest extension.

Finally, while the probabilistic model we have formulated in this section is theoretically well-motivated and mathematically well-defined, it has previously been challenging to actually derive predictions from it. Historically, interactive models have been challenging to study with closed-form analytical techniques and computationally expensive to study through simulation, likely contributing to the prevalence of simplified heuristics in prior work. Our work has been facilitated by recent advances in probabilistic inference techniques that have helped to overcome these obstacles. We have implemented our simulations in the probabilistic programming language WebPPL GoodmanStuhlmuller14_DIPPL. All of our simulations iterate the following trial-level loop: (1) sample an utterance from the speaker’s distribution, given the target object, (2) sample an object from the listener’s object distribution, given the utterance produced by the speaker, (3) append the results to the list of observations, and (4) update both agents’ posteriors, conditioning on these observations before continuing to the next trial.

To obtain the speaker and listener distributions (steps 1-2; Eq. 2), we always use exhaustive enumeration for exact inference. We would prefer to use enumeration to obtain posteriors over lexical meanings as well (step 4; Eq. 4), but as the space of possible lexicons grows, enumeration becomes intractable. For simulations related to P2 and P3

, we therefore switch to Markov Chain Monte Carlo (MCMC) methods to obtain samples from each agent’s posteriors, and approximate the expectations in Eq. 

3 by summing over these samples. Because we are emphasizing a set of phenomena where our model makes qualitatively different predictions than previous models, our goal in this paper is to illustrate and evaluate these qualitative predictions rather than provide exact quantitative fits to empirical data. As such, we proceed by examining predictions for a regime of parameter values () that help distinguish our predictions from other accounts.

2 Phenomenon #1: Ad hoc conventions become more efficient

We begin by considering the phenomenon of increasing efficiency in repeated reference games: speakers use detailed descriptions at the outset but converge to an increasingly compressed shorthand while remaining understandable to their partner. While this phenomenon has been extensively documented, to the point of serving as a proxy for measuring common ground, it has continued to pose a challenge for models of communication.

Figure 3: Path-dependence of conventions. The average trajectory of each agent’s beliefs about the meaning of , , is shown in blue and orange following all eight possible outcomes of the first trial in Simulation 1.1. For each of the two possible targets, the speaker could choose to produce either of the two utterances, and the listener could respond by choosing either of the two objects. In the cases where the listener chose correctly (marked with a checkmark), agents subsequently conditioned on the same data and rapidly converged on a system of meaning consistent with this feedback. For example, in the first row, when was successfully used to refer to the circle, both agents subsequently believe that means circle in their partner’s lexicon. In the cases where the listener fails to choose the target, the agents subsequently condition on different data, and they converge on a convention that is determined by later choices (lines represent the trajectories of individual agents.)

For example, one possibility is that speakers coordinate on meaning through priming mechanisms at lower levels of representation, as proposed by influential interactive alignment accounts pickering2004toward; pickering2006alignment; garrod2009joint. While low-level priming may be at play in repeated reference tasks, especially when listeners engage in extensive dialogue or alternate roles, it is not clear why priming would cause descriptions to get shorter as opposed to aligning on the same initial description. Furthermore, priming alone cannot explain why speakers still converge to more efficient labels even when the listener is prevented from saying anything at all and only minimal feedback is provided showing that the listener is responding correctly KraussWeinheimer66_Tangrams; conversely, speakers continue using longer descriptions when they receive non-verbal feedback that the listener is repeatedly making errors <see also¿hawkins2020characterizing. In these cases, there are no linguistic features available for priming or alignment mechanisms. Explaining when and why speakers believe that shorter descriptions will suffice requires a mechanism for coordination on meaning even given sparse, non-verbal feedback.

Another possibility is that speakers coordinate on meaning using some some lexical update rule that makes utterances more likely to be produced after communicative successes and less likely after communicative failures, such as a variant on the Roth-Erev reinforcement learning rule erev1998predicting adopted by a variety of agent-based models steels_self-organizing_1995; barr_establishing_2004; young_evolution_2015. While reinforcement is a powerful mechanism for allowing groups to reach consensus, it is not clear why a newly initialized speaker would prefer to produce longer utterances over shorter utterances, or, if this bias was built-in, how simply reinforcing initially long descriptions could lead utterances to get shorter. In the rare cases that some form of reduction has been investigated in this family of models (e.g. as in the phenomenon of phonological erosion), the process has been hard-coded as an probability of speakers dropping a random token at each point in time beuls2013agent; steels2016agent.

Such random dropping, however, is an unsatisfying explanations for several reasons. First, it formalizes a reductive explanation of efficiency in terms of speaker-internal noise (or laziness). This idea dates back to the early literature on repeated reference games and control experiments by HupetChantraine92_CollaborationOrRepitition were designed to test this possibility <see also¿GarrodFayLeeOberlanderMacLeod07_GraphicalSymbolSystems. Participants were asked to repeatedly refer to the same targets for a hypothetical partner to see later, such that any effects of familiarity or repetition on the part of the speaker were held constant with the interactive task. No evidence of reduction was found, and in some cases utterances actually grew longer. This accords with observations in multi-partner settings by wilkes-gibbs_coordinating_1992, which we explore further in P2: it is difficult to explain why a speaker who only shortened their descriptions due to an -noise process would suddenly switch back to a longer utterance when their partner is exchanged. Whatever drives efficiency cannot be explained through speaker laziness, it must be a result of the interaction between partners.

In this section, we argue that our Bayesian account provides a rational explanation for increasing efficiency in terms of the inferences made by speakers across repeated interaction. Given that this phenomenon arises in purely dyadic settings, it also provides an opportunity to explore more basic properties of the first two capacities formalized in our model (representing uncertainty and partner-specific learning) before introducing hierarchical generalization in the next section. In brief, we show that increasing efficiency is a natural consequence of the speaker’s tradeoff between informativity and parsimony (Eq. 3), given their inferences about the listener’s language model. For novel, ambiguous objects like tangrams, where speakers do not expect strong referential conventions to be shared, longer initial descriptions are motivated by high initial uncertainty in the speaker’s lexical prior . Proposing multiple descriptors is a rational hedge against the possibility that a particular utterance will be misinterpreted and give the listener a false belief. As the interaction goes on, the speaker obtains feedback from the listener responses and updates their posterior beliefs accordingly. As uncertainty gradually decreases, they are able to achieve the same expected informativity with shorter, more efficient messages.

Figure 4: Pairs of agents learn to successfully coordinate on efficient ad hoc conventions over repeated interactions. (A) agents converge on accurate communication systems in Simulation 1.1, where only single-word utterances are available, and (B) converge on shorter, more efficient conventions in Simulation 1.2, where multi-word utterances were available. Error bars are bootstrapped 95% CIs across 1000 trajectories, computed within each repetition block of two trials.

2.1 Simulation 1.1: Pure coordination

We build up to our explanation of increasing efficiency by first exploring a traditional signaling game scenario with only one-word utterances. This simulation tests the most fundamental competency for any model of ad hoc coordination: agents are able to coordinate on a communication system in the absence of shared priors. We consider the simplest possible reference game with two objects, , where the speaker must choose between two one-word utterances with equal production cost.

We walk explicitly through the first step of the simulation to illustrate the model’s dynamics (see Fig. 3). Suppose the target object presented to the speaker agent on the initial trial is . Both utterances are equally likely to apply to either object under the uniform lexical prior, hence each utterance is expected to be equally (un)informative. The speaker’s utility therefore reduces to sampling an utterance at random . Suppose is sampled. The listener then hears this utterance and selects an object according to their own expected utility under their uniform lexical prior, which also reduces to sampling an object at random . Suppose they choose, , a correct response. Both agents may use the resulting tuple , depicted in the top row in Fig. 3 to update their beliefs about the lexicon their partner is using.

They then proceed to the next trial, where they use this updated posterior distribution to produce or interpret language instead of their prior. To examine how the dynamics of this updating process unfold over further rounds, we simulated 1000 such trajectories. The trial sequence was structured as a repeated reference game, containing 30 trials structured into 15 repetition blocks. The two objects appeared in a random order within each block, and agents swapped roles at the beginning of each block. We show representative behavior at soft-max optimality parameter values and memory discounting parameter , but find similar behavior in a wide regime of parameter values (see Appendix Fig. A1).

We highlight several key results from this simulation. First, and most fundamentally, the communicative success of the dyad rises over the course of interaction: the listener is able to more accurately select the intended target object (see Fig. 4A). Second, the initial symmetry between meanings in the prior is broken by initial choices, leading to arbitrary but stable mappings in future rounds. Because agents were initialized with the same priors in every trajectory, trajectories only diverged when different actions happen to be sampled. This can be seen by examining the path-dependence of subsequent beliefs based on the outcome of the initial trial in Fig. 3. Third, we observe the influence of mutual exclusivity via Gricean pragmatic reasoning: agents also make inferences about objects and utterances that were not chosen. For example, observing provides evidence that likely does not mean (e.g. the third row of Fig. 3, where hearing refer to immediately led to the inference that likely refers to ).

2.2 Simulation 1.2: Increasing efficiency

Next, we show how our model explains speakers’ gains in efficiency over multiple interactions. For efficiency to change at all, speakers must be able to produce utterances that vary in length. For this simulation, we therefore extend the model to allow for multi-word utterances by allowing speakers to combine together multiple primitive utterances. Intuitively, human speakers form longer initial description by combining a collection of simpler descriptions (e.g. “kind of an X, or maybe a Y with Z on top”). This raises a problem about how the meaning of a multi-word utterance is derived from its components and . To capture the basic desideratum that an object should be more likely to be chosen by when more components of the longer utterance apply to it, we adopt a standard conjunctive semantics666One subtle consequence of a conjunctive Boolean semantics is the possibility of contradictions. For example, under a possible lexicon where and , the multi-word utterance is not only false of a particular referent in the current context, it is false of all possible referents; it fails to refer, reflecting a so-called truth-gap Strawson50_OnReferring; van1966singular. We assume such an utterance is uninterpretable and simply does not change the literal listener ’s beliefs. While this assumption is sufficient for our simulations, we regard this additional complexity as a limitation of classical Boolean semantics and show in Appendix B that switching to a more realistic continuous semantics with lexical values in the interval degen2020redundancy may better capture the notion of redundancy that motivates speakers to initially produce longer utterances.:

Now, we consider a scenario with the same two objects as in Simulation 1.1, but give the speaker four primitive utterances instead of only two, and allow two-word utterances such as . We established in the previous section that successful ad hoc conventions can emerge even in a state of pure uncertainty, but human participants in repeated reference games typically bring some prior expectations about language into the interaction. For example, a participant who hears ‘ice skater’ on the first round of the task in ClarkWilkesGibbs86_ReferringCollaborative may be more likely to select some objects more than others while still having substantial uncertainty about the intended target (e.g. over three of the twelve tangram that have some resemblance to an ice skater). We thus initialize both agents with weak biases  (represented in compressed matrix form in Fig. 5):

Figure 5: Schematic of speaker for first trial of Simulation 1.2. The speaker begins with uncertainty about the meanings in the listener’s lexicon (e.g. assigning 55% probability to the possibility that utterance means object .) A target is presented, and the speaker samples an utterance from the distribution . Finally, they observe the listener’s response and update their beliefs. Due to the compositional semantics of the utterance , the speaker becomes increasingly confident that both component primitives, and , apply to object in their partner’s lexicon.
Figure 6: Internal state of speaker in example trajectory from Simulation 1.2. Each term of the speaker’s utility (Eq. 3) is shown throughout an interaction. When the speaker is initially uncertain about meanings (far left), the longer utterance has higher expected informativity (center-left) and therefore higher utility (center-right) than the shorter utterances and , despite its higher cost (far-right). As the speaker observes several successful interactions, they update their beliefs and become more confident about the meanings of the component lexical items and . As a result, more efficient single-word utterances gradually gain in utility as cost begins to dominate the utility. On trial 5, is sampled, breaking the symmetry between utterances.

As in Simulation 1.1, we simulated 1000 distinct trajectories of dyadic interaction between agents. Utterance cost was defined to be the number of ‘words’ in an utterance, so and . As shown in Fig. 4B, our speaker agent initially prefers longer utterance (mean length on first block) but rapidly converges to shorter utterances after several repetitions (mean length on final block), qualitatively matching the curves measured in the empirical literature.

To illustrate in detail how our model derives this behavior, we walk step-by-step through a single trial (Fig. 5). Consider a speaker who wants to refer to object . They expect their partner to be slightly more likely to interpret their language using a lexicon in which and apply to this object, due to their weak initial biases. However, there is still a reasonable chance () that either or alone will be interpreted to mean , giving their partner false beliefs.

To see why our speaker model initially prefers the longer utterance to hedge against this possibility, despite its higher production cost, consider the expected informativity of under different possible lexicons. The possibility with highest probability is that both in the listener’s lexicon (), in which case the listener will correctly identify with high probability. The possibility that both in the listener’s lexicon is only , in which case the listener will erroneously select . In the mixed cases, where or in the listener’s lexicon (), the utterance would be a interpreted as a contradiction and the listener would not change their prior beliefs. Because the speaker’s informativity is defined using the log probability of the listener’s belief, the utility of giving the listener a false belief (i.e. ) is significantly worse than simply being uninformative (i.e. ), and the longer utterance minimizes this harm.

Following the production of a conjunction, the speaker observes the listener’s response (say, ). This allows both agents to become more confident that the component utterances and mean in their updated posterior over the listener’s lexicon. This credit assignment to individual lexical items is a consequence of the compositional meaning of longer utterances in our simple grammar. The listener knows a speaker for whom either or individually means would have been more likely to say than a speaker for whom either component meant ; and similarly for the speaker reasoning about possible listeners. Consequently, the probability of both mappings increases.

Fig. 6 shows the trajectories of internal components of the speaker utility as the interaction continues. We assume for illustrative purposes in this example that

continues to be the target on each trial and the same agent continues to be the speaker. As the posterior probability that individual primitive utterances

and independently mean increases (far left), the marginal gap in informativity between the conjunction and the shorter components gradually decreases (center left). As a consequence, production cost increasingly dominates the utility (center-right). After several trials of observing a successful listener response given the conjunction, the informativity of the two shorter utterances reaches parity with the conjunction but the cost makes the shorter utterances more attractive (yielding a situation now similar to the outset of Simulation 1.1). Once the speaker samples one of the shorter utterances (e.g. ), the symmetry collapses and that utterance remains most probable in future rounds, allowing for a stable and efficient ad hoc convention. Thus, increasing efficiency is derived as a rational consequence of uncertainty and partner-specific inference about the listener’s lexicon. For these simulations, we used but the qualitative reduction effect is found over a range of different parameters (see Appendix Fig. A2).

2.3 Discussion

The simulations presented in this section aimed to establish a rational explanation for feedback-sensitive increases in efficiency over the course of ad hoc convention formation. Speakers initially hedge their descriptions under uncertainty about the lexical meanings their partner is using, but are able to get away with less costly components of those descriptions as their uncertainty decreases. Our explanation recalls classic observations about hedges, expressions like sort of or like, and morphemes like -ish, that explicitly mark provisionality, such as a car, sort of silvery purple colored Fraser10_Hedging; MedlockBriscoe07_HedgeClassification. BrennanClark96_ConceptualPactsConversation counted hedges across repetitions of a repeated reference game, finding a greater occurrence of hedges on early trials than later trials and a greater occurrence under more ambiguous contexts. While our model does not explicitly produce hedges, it is possible to understand this behavior as an explicit or implicit marker of the lexical uncertainty in our account. If hedges are explicit and intentional, this would require the speaker to reason about a listener that is to some degree aware of uncertainty or graded meanings. This is not the case with a classical truth-functional lexical representation, but is formalized naturally under soft semantics (see Appendix B).

We have already discussed why this phenomenon poses a challenge for the simple model-free reinforcement learning models in the literature — namely, that successful listener feedback would only reinforce long utterances with no mechanism for shortening them. This observation is not intended to rule out the entire family of reinforcement learning approaches, however. It is plausible that more sophisticated model-based reinforcement learning algorithms are flexible enough to account for the phenomenon. For instance, hierarchical architectures that appropriately incorporate compositionality or incrementality into the speaker’s production model may be able to reinforce component parts of longer utterances in the shared history <e.g.¿hawkins2019continual. Still, such an approach would have more in common with our proposal than to the model-free heuristics in the existing literature. We return to this question with respect to scalability in the General Discussion.

Finally, the theory of reduction explored in this section is consistent with recent analyses of exactly what gets reduced in a large corpus of repeated reference games hawkins2020characterizing. These analyses found that entire modifying clauses are more likely to be dropped at once than expected by random corruption, and function words like determiners are mostly dropped as parts of larger noun phases or prepositional phrases rather than omitted on their own. In other words, speakers apparently begin by combining multiple descriptive ‘units’ and collapse to one of these ‘units’, rather than dropping words at random and assuming the listener can recover the intended longer utterance, as predicted by a noisy channel model. This theoretical claim is further supported by early hand-tagged analyses by Carroll80_NamingHedges, which found that in three-quarters of transcripts from krauss_changes_1964 the conventions that participants eventually converged upon were prominent in some syntactic construction at the beginning, often as a head noun that was initially modified or qualified by other information.

While our account explains these observations as a result of structure and heterogeneity in the lexical prior, it remains an open question of how to instantiate appropriately realistic priors in our computational model. Our simulation only considered two-word descriptions with homogenous uncertainty over the components, and is likely that the semantic components of real initial descriptions have more heterogeneous uncertainty: for example, the head noun may be chosen due to a higher prior probability of being understood by the listener than other components of the initial description, thus predicting asymmetries in reduction. Future work is needed to elicit these priors and evaluate predictions about more fine-grained patterns of reduction.

3 Phenomenon #2: Conventions gradually generalize to new partners in community

How do we make the inferential leap from ad hoc conventions formed through interaction with a single partner to global conventions expected to be shared throughout a community? Grounding collective convention formation in the individual learning mechanisms explored in the previous section requires an explicit theory of generalization capturing how people transfer what they have learned from one partner to the next.

One influential theory is that speakers simply ignore the identity of different partners and update a single monolithic representation after every interaction steels_self-organizing_1995; barr_establishing_2004; young_evolution_2015. We call this a complete-pooling theory because data from each partner is collapsed into an undifferentiated pool of evidence gelman2006data. Complete-pooling models have been remarkably successful at predicting collective behavior on networks, but have typically been evaluated only in settings where anonymity is enforced. For example, centola_spontaneous_2015 asked how large networks of participants coordinated on conventional names for novel faces. On each trial, participants were paired with a random neighbor but were not informed of that neighbor’s identity, or the total number of different possible neighbors.

While complete-pooling may be appropriate for some everyday social interactions, such as coordinating with anonymous drivers on the highway, it is less tenable for everyday communicative settings. Knowledge about a partner’s identity is both available and relevant for conversation eckert_three_2012; davidson_nice_1986. Partner-specificity thus poses clear problems for complete-pooling theories but can be easily explained by another simple model, where agents maintain separate expectations about meaning for each partner. We call this a no-pooling model <note that this no-pooling model was compared with a complete-pooling model in¿SmithEtAl17_LanguageLearning. The problem with no-pooling is that agents are forced to start from scratch with each partner. Community-level expectations never get off the ground.

Our hierarchical partial-pooling account offers a compromise between these extremes. Unlike complete-pooling and no-pooling models, we propose that beliefs about language have hierarchical structure. That is, the meanings used by different partners are expected to be drawn from a shared community-wide distribution but are also allowed to differ from one another in systematic, partner-specific ways. This structure provides an inductive pathway for abstract population-level expectations to be distilled from partner-specific experience.

The key predictions distinguishing our model thus concern the pattern of generalization across partners. Experience with a single partner ought to be relatively uninformative about further partners, hence our partial-pooling account behaves much like a no-pooling model in predicting strong partner-specificity and discounting outliers

<see¿[which explores this prediction in a developmental context]dautriche2021. After interacting with multiple partners in a tight-knit community, however, speakers should become increasingly confident that labels are not simply idiosyncratic features of a particular partner’s lexicon but are shared across the entire community, gradually transitioning to the behavior of a complete-pooling model. In this section, we test this novel prediction in a networked communication game and explicitly compare our model to pure complete-pooling and no-pooling variants.

3.1 Model predictions

Figure 7: In our simulations and behavioral experiment, participants were (A) placed in fully-connected networks of 4, and (B) paired in a round-robin schedule of repeated reference games with each neighbor.

We first compare the generalization behavior produced by each model by simulating the outcomes of interacting with multiple partners on a small network (see Fig. 7A). We used a round-robin scheme (Fig. 7B) to schedule four agents into a series of repeated reference games with their three neighbors, playing 8 successive trials with one partner before advancing to the next, for a total of 24 trials. These reference games used a set of two objects and four utterances as in Simulation 1.2; agents were randomized to roles when assigned to a new partner and swap roles after each repetition block.

Unlike our previous simulations with a single partner, where hierarchical generalization was irrelevant, we must now specify the hyper-prior P() governing the overall distribution of partners (Eq. 4). Following KempPerforsTenenbaum07_HBM, we extend the uniform categorical prior over objects to a hierarchical Dirichlet-Multinomial model gelman_bayesian_2014, where the categorical prior over the partner-specific meaning of , , is not uniform, but given by a parameter that is shared across the entire population. Because is a vector of probabilities that must sum to 1, we assume it is drawn from a Dirichlet prior:

where gives the concentration parameter encoding the agent’s beliefs, or “over-hypotheses” about both the central tendency and the variability of lexicons in the population. The relative values of the entries of correspond to inductive biases regarding the central tendency of lexicons while the absolute magnitude of the scaling factor roughly corresponds to prior beliefs about the spread, where larger magnitudes correspond to more concentrated probability distributions. We set and assume the agent has uncertainty about the population-level central tendency by placing a hyper-prior on <see¿cowans2004information roughly corresponding to the weak initial preferences we used in our previous simulations:

Figure 8: Simulation results and empirical data for (A) speaker reduction, and (B) network convergence across three partners. Vertical lines mark boundaries where new partners were introduced, and the thin grey line represents beliefs about a novel partner at each point in time.

We may then define the no-pooling and complete-pooling models by lesioning this shared structure in different ways. The no-pooling model assumes an independent for every partner, rather than sharing a single population-level parameter. Conversely, the complete-pooling model assumes a single, shared rather than allowing different values for different partners. We simulated 48 networks for each model, setting (see Fig. A3 in the Appendix for an exploration of other parameters).

Speaker utterance length across partners

We begin by examining our model’s predictions about how a speaker’s referring expressions change with successive listeners. While it has been frequently observed that messages reduce in length across repetitions with a single partner krauss_changes_1964 and sharply revert back to longer utterances when a new partner is introduced wilkes-gibbs_coordinating_1992, the key prediction distinguishing our model concerns behavior across subsequent partner boundaries. Complete-pooling accounts predict no reversion in number of words when a new partner is introduced (Fig. 8A, first column). No-pooling accounts predict that roughly the same initial description length will re-occur with every subsequent interlocutor (Fig. 8A, second column).

Here we show that a partial pooling account predicts a more complex pattern of generalization. First, unlike the complete-pooling model, we find that the partial-pooling speaker model reverts or jumps back to a longer description at the first partner swap. This reversion is due to ambiguity about whether the behavior of the first partner was idiosyncratic or attributable to community-level conventions. In the absence of data from other partners, a partner-specific explanation is more parsimonious. Second, unlike a no-pooling model, after interacting with several partners, the model becomes more confident that one of the short labels is shared across the entire community, and is correspondingly more likely to begin a new interaction with it (Fig. 8A, third column).

It is possible, however, that these two predictions only distinguish our partial-pooling models at a few parameter values; the no-pooling and complete-pooling could produce these qualitative effects elsewhere in parameter space. To conduct a more systematic model comparison, then, we simulated 10 networks in each cell of a large grid manipulating the the optimality parameters , the cost parameter , and the memory discounting parameter . We computed a “reversion” statistic (the magnitude of the change in immediately after a partner swap) and a “generalization” statistic (the magnitude of the change in from the initial trial with the agent’s first partner to the initial trial with the final partner) and conducted single-sample -tests at each parameter value to compare these statistics with what would be expected due to random variation. We found that only the partial-pooling model consistently makes both predictions across a broad regime. The complete-pooling model fails to predict reversion nearly everywhere while the no-pooling model fails to predict generalization nearly everywhere. Detailed results are shown in Fig. A4 in the Appendix.

Network convergence

Because all agents are simultaneously making inferences about the others, the network as a whole faces a coordination problem. For example, in the first block, agents 1 and 2 may coordinate on using to refer to while agent 3 and 4 coordinate on using . Once they swap partners, they must negotiate this potential mismatch in usage. How does the network as a whole manage to coordinate?

We measured alignment by examining the intersection of utterances produced by speakers: if two agents produced overlapping utterances to refer to a given target (i.e. a non-empty intersection), we assign a 1, otherwise we assign a 0. At the beginning and end of each interaction, we calculated alignment between currently interacting agents (i.e. within a dyad) and those who were not interacting (i.e. across dyads), averaging across target objects. Alignment across dyads was initially near chance, reflecting the arbitrariness of whether speakers reduce to or . Under a complete-pooling model (Fig. 8B, first column), agents sometimes persist with mis-calibrated expectations learned from previous partners rather than adapting to their new partner, and within-dyad alignment deteriorates. Under a no-pooling model (Fig. 8B, second column), convergence on subsequent blocks remains near chance, as conventions need to be re-negotiated from scratch. By contrast, under our partial-pooling model, alignment across dyads increases without affecting alignment within dyads, suggesting that hierarchical inference leads to emergent consensus (Fig. 8B, third column).

3.2 Behavioral experiment

To evaluate the predictions derived in our simulations, we designed a natural-language communication experiment following roughly the same network design as our simulations. That is, instead of anonymizing partners, as in many previous empirical studies of convention formation <e.g.¿centola_spontaneous_2015, we divided the experiment into blocks of extended dyadic interactions with stable, identifiable partners <see¿[for similar designs]fay_interactive_2010, garrod_conversation_1994. Each block was a full repeated reference game, where participants had to coordinate on ad hoc conventions for how to refer to novel objects with their partner. Our partial-pooling model predicted that these conventions will partially reset at partner boundaries, but agents should be increasingly willing to transfer expectations from one partner to another.

Participants

We recruited 92 participants from Amazon Mechanical Turk to play a series of interactive, natural-language reference games using the framework described in Hawkins15_RealTimeWebExperiments.

Stimuli and procedure

Each participant was randomly assigned to one of 23 fully-connected networks with three other participants as their ‘neighbors’ (Fig. 7A). Each network was then randomly assigned one of three distinct ”contexts” containing abstract tangram stimuli taken from ClarkWilkesGibbs86_ReferringCollaborative. The experiment was structured into a series of three repeated reference games with different partners, using these same four stimuli as referents. Partner pairings were determined by a round-robin schedule (Fig. 7B). The trial sequence for each reference game was composed of four repetition blocks, where each target appeared once per block. Participants were randomly assigned to speaker and listener roles and swapped roles on each block. After completing sixteen trials with one partner, participants were introduced to their next partner and asked to play the game again. This process repeated until each participant had partnered with all three neighbors. Because some pairs within the network took longer than others, we sent participants to a temporary waiting room if their next partner was not ready.

Each trial proceeded as follows. First, one of the four tangrams in the context was highlighted as the target object for the speaker. They were instructed to use a chatbox to communicate the identity of this object to their partner, the listener (see Fig. 7C). The two participants could engage freely in dialogue through the chatbox but the listener must ultimately make a selection from the array. Finally, both participants in a pair were given full feedback on each trial about their partner’s choice and received bonus payment for each correct response. The order of the stimuli on the screen was randomized on every trial to prevent the use of spatial cues (e.g. ‘the one on the left’). The display also contained an avatar for the current partner representing different partners with different colors as shown in Fig. 7 to emphasize that they were speaking to the same partner for an extended period. On the waiting screen between partners, participants were shown the avatars of their previous partner and upcoming partner and told that they were about to interact with a new partner.

3.3 Results

We evaluated participants’ generalization behavior on the same three metrics we used in our simulations: accuracy, utterance length, and network convergence.

Network convergence

First, we examine the content of conventions and evaluate the extent to which alignment increased across the network over the three partner swaps. Specifically, we extend the same measure of alignment used in our simulations to natural language data by examining whether the intersection of words produced by different speakers was non-empty. We excluded a list of common stop words (e.g. ’the’, ’both’) to focus on the core conceptual content. While this pure overlap measure provides a relatively weak notion of similarity, a more continuous measure based on the size of the intersection or the string edit distance yielded similar results.

As in our simulation, the main comparison of interest was between currently interacting participants and participants who are not interacting: we predicted that within-pair alignment should stay consistently high while (tacit) alignment between non-interacting pairs will increase. We thus constructed a mixed-effects logistic regression including fixed effects of pair type (within vs. across), partner number, and their interaction. We included random intercepts at the tangram level and maximal random effects at the network level (i.e. intercept, both main effects, and the interaction). As predicted, we found a significant interaction (

; see Fig. 8C, bottom row). Although different pairs in a network may initially use different labels, these labels begin to align over subsequent interactions.

Speaker utterance length

Now we are in a position to evaluate the central prediction of our model. Our partial pooling model predicts (1) gains in efficiency within interactions with each partner and (2) reversions to longer utterances at partner boundaries, but (3) gradual shortening of the initial utterance chosen with successive partners. As a measure of efficiency, we calculated the raw length (in words) of the utterance produced on each trial. Because the distribution of utterance lengths is heavy-tailed, we log-transformed these values. To test the first prediction, we constructed a linear mixed-effects regression predicting trial-level speaker utterance length (see Fig. 8B, bottom row). We included a fixed effect of repetition block within partner (1, 2, 3, 4), along with random intercepts and slopes for each participant and each tangram. We found that speakers reduced utterance length significantly over successive interactions with each individual partner, .

To test the extent to which speakers revert to longer utterances at partner boundaries, we constructed another regression model. We coded the repetition blocks immediately before and after each partner swap, and included it as a categorical fixed effect. Because partner roles were randomized for each game, the same participant did not always serve as listener in both blocks, so in addition to tangram-level intercepts, we included random slopes and intercepts at the network level (instead of the participant level). As predicted, we found that utterance length increased significantly at the two partner swaps, .

Finally, to test whether efficiency improves for the very first interaction with each new partner, before observing any partner-specific information, we examined the simple effect of partner number at the trials immediately after the partner swap (i.e. ). We found that participants gradually decreased the length of their initial descriptions with each new partner in their network, (see Fig. 8B, bottom row), suggesting that speakers are bringing increasingly well-calibrated expectations into interactions with novel neighbors. The partial-pooling model is the only model predicting all three of these effects.

3.4 Discussion

Our partial pooling model, which follows from general principles of hierarchical Bayesian inference, suggests that conventions represent the shared structure that agents ”abstract away” from partner-specific learning. In this section, we evaluated the extent to which our partial pooling model captured human generalization behavior in a natural-language communication experiment on small networks. Unlike complete-pooling accounts, our model allows for partner-specific common ground to override community-wide expectations given sufficient experience with a partner, or in the absence of strong conventions. Unlike no-pooling accounts, it results in networks that converge on more efficient and accurate expectations about novel partners.

While it is not easily classified into either the complete-pooling or no-pooling classes, it is also not straightforward for the priming mechanisms proposed by

interactive alignment accounts to explain these patterns of partner-specificity without being augmented with additional social information. If a particular semantic representation has been activated due to precedent in the preceding dialogue, then the identity of the speaker should not in principle alter its continued influence brennan2009partner. More sophisticated hierarchical memory retrieval accounts that represent different partners as different contexts <e.g¿polyn2009context may be consistent with partner-specificity, but evoking such an account presupposes that social information like partner identity is already a salient and relevant feature of the communicative environment and thus no longer relies purely on “egocentric” priming and activation mechanisms. Indeed, a process-level account assuming socially-aware context reinstatement for partner-specific episodic memories, and slower consolidation of shared features into population-level expectations, may be one possible candidate for realizing our computational-level model (see General Discussion for more on process-level evidence).

4 Phenomenon #3:
Conventions are shaped by communicative context

Figure 9: Domain for context-sensitivity. (A) Targets are related to one another in a conceptual taxonomy. (B) Speakers choose between labels, where the label “niwa” has been selected. (C) Examples of fine and coarse contexts. In the fine context, the target (marked in black) must be disambiguated from a distractor (marked in grey) at the same subordinate-level branch of the taxonomy. In the coarse context, the closest distractor belongs to a different branch of the center-level of the taxonomy (i.e. a spotted circle) such that disambiguation at the sub-ordinate level is not required.

In the previous two sections, we examined a mechanism for rapid, partner-specific learning that allows agents to form stable but arbitrary ad hoc conventions with partners and gradually generalize them to their entire community. The final phenomenon we consider is the way that ad hoc conventions are shaped by the communicative context in which they form. This phenomenon is most immediately motivated by recent behavioral results finding that more informative words in the local context are significantly more likely to become conventionalized hawkins2020characterizing. However, our broader theoretical aim is to suggest that diachronic patterns in the long-term evolution of a community’s lexical conventions, as highlighted by the Optimal Semantic Expressivity (OSE) hypothesis frankblogpost, may be explained as a result of the synchronic processes at play when individual dyads coordinate on ad hoc meanings.

Briefly, when there is already a strong existing convention that is expected to be shared across the community, our model predicts that speakers will use it. New ad hoc conventions arise precisely to fill gaps in existing population-level conventions, to handle new situations where existing conventions are not sufficient to accurately and efficiently make the distinctions that are required in the current context. A corollary of this prediction777This follows by induction from the hierarchical generalization mechanisms evaluated for P2, which provide the pathway by which ad hoc conventions become adopted by a larger community over longer time scales. Many ad hoc conventions never generalize to the full language community simply because the contexts where they are needed are rare. They must be re-negotiated with subsequent partners on an ad hoc basis. is that ad hoc conventions may only shift to expectations at the population level (and ultimately to population-level convergence) when those distinctions are consistently relevant across interactions with different partners. For example, while most English speakers have the basic-level word “tree” in their lexicon, along with a handful of subordinate-level words like “maple” or “fir,” we typically do not have conventionalized labels exclusively referring to each individual tree in our yards – we are rarely required to refer to individual trees. Meanwhile, we do often have shared conventions (i.e. proper nouns) for individual people and places that a community regularly encounters and needs to distinguish among. Indeed, this logic may explain why a handful of particularly notable trees do have conventionalized names, such as the Fortingall Yew, the Cedars of God, and General Sherman, the giant sequoia.

As a first step toward explaining these diachronic patterns in which conventions form, we aim to establish in this section that our model allows a single dyad’s ad hoc conventions to be shaped by communicative context over short timescales. Specifically, our model predicts that people will form conventions at the highest level of abstraction that is able to satisfy their communicative needs. That is, when the local environment imposes a communicative need to refer to particular ad hoc concepts (e.g. describing a particular tree that needs to be cut down), communicative partners are able to coordinate on efficient lexical conventions for successfully doing so (e.g. “the mossy one”). We begin by showing that context-sensitivity naturally emerges from our model, as a downstream consequence of recursive pragmatic reasoning. We then empirically evaluate this account by manipulating which distinctions are relevant in an artificial-language repeated reference game building on WintersKirbySmith14_LanguagesAdapt,winters2018contextual, allowing us to observe the emergence of ad hoc conventions from scratch. In both the empirical data and our model simulations, we find that conventions come to reflect the distinctions that are functionally relevant for communicative success and that pragmatic reasoning is needed for these effects to arise.

4.1 Model predictions

To evaluate the impact of context on convention formation, we require a different task than we used in the previous sections. Those tasks, like most reference games in the literature on convention formation, used a discrete set of unrelated objects in a fixed context, . In real referential contexts, however, targets are embedded in larger conceptual taxonomies, where some objects are more similar than others bruner1956study; collins1969retrieval; XuTenenbaum07_WordLearningBayesian. Here, we therefore consider a space of objects embedded in a three-level stimulus hierarchy with shape at the top-most level, color/texture at the intermediate levels, and frequency/intensity at the finest levels (see Fig. 9A). While we will use the full stimulus set in our empirical study, it is sufficient for our simulations to consider just one of the branches (i.e. just the squares). We populate the space of possible utterance meanings with 4 meanings at the sub-ordinate level (one for each individual object, e.g. “light blue square”), 2 meanings at the center-level (e.g. “blue square”), and 1 meanings at the super-ordinate level (e.g. “square”). We then populate the utterance space with 8 single-word labels (Fig. 9B) and also allow for a “null” meaning with an empty extension to account for the possibility that some utterances are not needed, allowing the agent to effectively remove utterances from their vocabulary.

Additionally, in real environments, speakers do not have the advantage of a fixed context; the relevant distinctions change from moment to moment as different subsets of objects are in context at different times. This property poses a challenge for models of convention formation because the relevant distinctions cannot be determined from a single context, they must be abstracted over time. We therefore only displayed four of the eight objects in the context on a given trial. Distractors could differ from the target at various levels of the hierarchy, creating different types of contexts defined by the finest distinction that had to be drawn (e.g. Fig. 

9C). Critically, we manipulated the prevalence of different kinds of contexts, controlling how often participants are required to make certain distinctions to succeed at the task. In the fine condition, every context contained a subordinate distractor, requiring fine low-level distinctions to be drawn. In the coarse condition, contexts never contained subordinate distractors, only distractors that differed at the central level of the hierarchy (e.g. a blue square when the target is a red square). For comparison, we also include a mixed condition, where targets sometimes appear in fine contexts with subordinate distractors and other times appear in coarse contexts without them; the context type is randomized between these two possibilities on each trial.

We constructed the trial sequence identically for the three conditions. On each trial, we randomly sampled one of the four objects to be the target, ensuring that no target appeared more than once in a row. Then we sampled a distractor according to the constraints of the context type. As before, the agents swap roles on each trial, and we run 50 distinct trajectories with parameter settings of and memory discounting parameter of .

Figure 10: Comparison of simulation results to empirical data. (A) Agents in our simulation learn to coordinate on a successful communication system, but converge faster in the coarse condition than the fine condition. (B) The number of unique words used by agents in each repetition block increased in the fine condition but stayed roughly constant in the coarse condition. (C-D) The same metrics computed on our empirical data, qualitatively matching the patterns observed in the simulations. Each point is the mean proportion of correct responses by listeners; curves are nonparametric fits.
Figure 11: Dynamics of lexical beliefs over time in model simulations. Regions represent the average proportion of words at each level of generality in an agent’s beliefs about the lexicon. In the coarse condition, agents initially assume subordinate terms but gradually abstract away to a smaller number of more general terms; in the fine and mixed conditions, however, agents become more confident of subordinate terms.

4.1.1 Partners successfully learn to communicate

First, we compare the model’s learning curves across context conditions (Fig. 10A). We focus on the coarse and fine conditions for simplicity, since this single comparison captures the core phenomena of interest. In a mixed-effects logic regression, we find that communicative accuracy steadily improves over time across all conditions, . However, accuracy also differed across conditions: adding a main effect of condition significantly improves model fit, . Accuracy is significantly higher in the coarse condition than the fine condition and marginally higher than the mixed condition.

4.1.2 Lexical conventions are shaped by context

As an initial marker of context sensitivity, we examine the effective vocabulary sizes used by speakers in each condition. We operationalized this measure by counting the total number of unique words produced within each repetition block. This measure takes a value of 8 when a different word is consistently used for every object, and a value of 1 when exactly the same word is used for every object. In an mixed-effects regression model including intercepts and random effects of trial number for each simulated trajectory, we find an overall main effect of condition, with agents in the fine condition using significantly fewer words across all repetition blocks ( in coarse, in fine, ). However, we also found a significant interaction: the effective vocabulary size gradually increased over time in the fine condition, while it stayed roughly constant in in the coarse condition, , see Fig. 10B.

Next, we examine more closely the emergence of terms at different levels of generality. We have access not only to the signaling behavior of our simulated agents, but also their internal beliefs about their partner’s lexicon, which allows us to directly examine the evolution of these beliefs from the beginning of the interaction. At each time point in each game, we take the single meaning with highest probability for each word. In Fig. 11, we show the proportion of words with meanings at each level of generality, collapsing across all games in each condition.

Qualitatively, we observe that agents begin by assuming null meanings (i.e. with an effectively empty vocabulary) but quickly begin assigning meanings to words based on their partner’s usage. In both conditions, basic-level meanings and subordinate-meanings are equally consistent with the initial data, but the simplicity prior prefers smaller effective vocabulary sizes where each word has smaller extensions. After the first repetition block, however, agents in the coarse condition begin pruning out some of the subordinate-level terms and become increasingly confident of basic-level meanings. Agents in the fine condition become even more confident of subordinate level meanings.

By the final trial, the proportion of basic-level vs. subordinate-level terms is significantly different across the coarse and fine conditions. Only 9% of words had subordinate-level meanings (green) in the coarse condition, compared with 79% in the fine condition, . At the same time, 45% of words had basic-level meanings (blue) in the coarse condition, compared with only 8% in the fine condition, . The remaining words in each condition were assigned the ‘null‘ meaning (red), consistent with an overall smaller effective vocabulary size in the coarse condition. The diverging conventions across contexts are driven by Gricean expectations: because the speaker is assumed to be informative, only lexicons distinguishing between subordinate level objects can explain the speaker’s behavior in the fine condition.

4.2 Experimental methods

In this section, we evaluate our model’s qualitative predictions about the effect of context on convention formation using an interactive behavioral experiment closely matched to our simulations. We use a between-subjects design where pairs of participants are assigned to different communicative contexts and test the extent to which they converge on meaningfully different conventions.

4.2.1 Participants

We recruited 278 participants from Amazon Mechanical Turk to play an interactive, multi-player game.888This experiment was pre-registered at https://osf.io/2hkjc/

. All statistical tests in mixed-effects models reported in this section use degrees of freedom based on the Satterthwaite approximation

luke2017evaluating..

4.2.2 Procedure & Stimuli

Participants were paired over the web and placed in a shared environment containing an array of objects (Fig. 9A) and a ‘chatbox’ to choose utterances from a fixed vocabulary by clicking-and-dragging (Fig. 9B). On each trial, one player (the ‘speaker’) was privately shown a highlighted target object and allowed to send a single word to communicate the identity of this object to their partner (the ‘listener’), who subsequently made a selection from the array. Players were given full feedback, swapped roles each trial, and both received bonus payment for each correct response.

We randomly generated distinct arrays of 16 utterances for each pair of participants (more than our model, which was restricted by computational complexity). These utterances were created by stringing together consonant-vowel pairs into pronounceable 2-syllable words to reduce the cognitive load of remembering previous labels (see Fig. 9B) These arrays were held constant across trials.

To match our model as closely as possible, pairs were assigned one of the same sequences of trials that we constructed for our simulations. In addition to behavioral responses collected over the course of the game, we designed a post-test to explicitly probe players’ final lexica. For all sixteen words, we asked players to select all objects that a word can refer to (if any), and for each object, we asked players to select all words that can refer to it (if any). This bidirectional measure allowed us to check the internal validity of the lexica reported. Pairs were randomly assigned to one of three different conditions, yielding dyads in the coarse condition, in the fine condition, and in the mixed condition after excluding participants who disconnected before completion.

Figure 12: Different lexicons emerge in different contexts. Mean number of words, out of a word bank of 16 words, that human participants reported giving more specific meanings (black; applying to 1 object) or less specific meanings (dark grey; applying to 2 objects) in the post-test.

4.3 Behavioral results

4.3.1 Partners successfully learn to communicate

Although participants in all conditions began with no common basis for label meanings, performing near chance on the first trial (proportion correct , 95% CI ), most pairs were nonetheless able to coordinate on a successful communication system over repeated interaction (see Fig. 10C). A mixed-effects logistic regression on listener responses with trial number as a fixed effect, and including by-pair random slopes and intercepts, showed a significant improvement in accuracy overall, . Accuracy also differed significantly across conditions: adding an additional main effect of condition to our logistic model provided a significantly better fit, . Qualitatively, the coarse condition was easiest for participants, the fine condition was hardest, and the mixed condition was in between. These effects track the most important qualitative feature of our simulations – our artificial agents were also able to successfully coordinate in both conditions, and did so more easily in the coarse condition than the fine condition. However, we found that the speed of coordination in the mixed and fine conditions was larger than predicted in our simulations. The additional difficulty participants’ experienced in the fine condition may be due to additional motivational constraints, memory constraints, or other factors not captured in our model.

4.3.2 Validating post-test responses

Before examining post-test responses, we validate their internal consistency. For each participant, we counted the number of mismatches between the two directions of the lexicon question (e.g. if they clicked the word ‘mawa’ when we showed them one of the blue squares, but failed to click that same blue square when we showed ‘mawa’). In general, participants were highly consistent: out of 128 cells in the lexicon matrix (16 words 8 objects), the median number of mismatches was 2 (98% agreement), though the distribution has a long tail (mean ). We therefore conservatively take a participant’s final lexicon to be the intersection of their word-to-object and object-to-word responses for the subsequent analyses.

4.3.3 Contextual pressures shape the lexicon

We predicted that in contexts regularly requiring speakers to make fine distinctions among objects at subordinate levels of the hierarchy, we would find lexicalization of specific terms for each object (indeed, a one-to-one mapping may be the most obvious solution in a task with only 8 objects). Conversely, when no such distinctions were required, we expected participants to adaptively conventionalize more general terms that could be reused across different contexts. One coarse signature of this prediction lies in the compression of the resulting lexicon: less specific conventions should allow participants to achieve the same communicate accuracy with a smaller vocabulary. To test this prediction, we operationalize vocabulary size as the number of words in each participant’s reported lexicon in the post-test (i.e. the words for which they marked at least one object in the post-test in an internally consistent way). We then conducted a mixed-effects regression predicting each individual’s vocabulary size as a function of dummy-coded condition factors, with random intercepts for each game. We found that participants in the coarse condition reported significantly smaller, simpler lexica ( words) than participants in the mixed () and fine condition (; see Fig. 12).

What allowed participants in the coarse condition get away with fewer words in their lexicon while maintaining high accuracy? We hypothesized that each word had a larger extension size. To test this hypothesis, we counted the numbers of ‘specific’ terms (e.g. words that refer to only one object) and more ‘general’ terms (e.g. words that refer to two objects) in the post-test. We found that the likelihood of lexicalizing more general terms differed systematically across conditions. Participants in the coarse condition reported significantly more general terms () than in the fine () or mixed () conditions, where lexicons contained almost exclusively specific terms. Using the raw extension size of each word as the dependent variable instead of counts yielded similar results. Indeed, the modal system in the fine condition was exactly eight specific terms with no more general terms, and the modal system in the coarse condition was exactly four general terms (red, blue, striped, spotted) with no specific terms. However, many individual participants reported a mixture of terms at different levels of generality (see Appendix Fig. A5).

Finally, how did these lexica emerge over the course of interaction? We use the same measure of unique words produced in each repetition block that we used in our simulations (Fig. 10D). We constructed a mixed-effects regression model predicting the effective vocabulary size, including fixed effects of condition and repetition block, and random intercepts and effects of repetition block for each dyad. As in the post-test reports, we found an overall main effect of condition, with participants in the coarse condition using significantly fewer words across all repetition blocks: in coarse, compared to in mixed and in fine. Critically, however, we also found a significant interaction between block and condition. The effective vocabulary size gradually increased over time in the fine condition but remained roughly constant in the coarse condition, , see Fig. 10D. This interaction, where participants initially attempt to reuse the same terms across targets in the fine condition, is consistent with a gradual differentiation based on communicative need.

4.4 Discussion

There is abundant evidence that languages adapt to the needs of their users. Our model provides a cognitive account of how people coordinate on ad hoc linguistic conventions that suit their immediate needs. In this section, we evaluated predictions about context-sensitivity using new data from a real-time communication task. When combined with the generalization mechanisms explored in the previous section, such rapid learning within dyadic interactions may be a powerful contributor allowing languages to adapt at the population-level over longer time scales.

Previous studies of convention formation have addressed context-sensitivity in different ways. In some common settings, there is no explicit representation of context at all, as in the task known as the “Naming Game” where agents coordinate on names for objects in isolation steels2012experiments; baronchelli2008depth. In other settings, communication is situated in a referential context, but this context is held constant, as in Lewis signaling games lewis_convention:_1969 where agents must distinguish between a fixed set of world states skyrms2010signals; BrunerEtAl14_LewisConventions. Finally, in the Discrimination Game steels2005coordinating; baronchelli2010modeling, contexts are randomly generated on each trial, but have not been manipulated to assess context-sensitivity of the convention formation process.

In other words, context-sensitivity has typically been implicit in existing models. Models using simple update rules have accounted for referential context with a lateral inhibition heuristic used by both the speaker and listener agents franke2012bidirectional; steels2005coordinating. If communication is successful, the connection strength between the label and object is not only increased, the connection between the label and competing objects (and, similarly, between the object and competing labels) is explicitly decreased by a corresponding amount. This lateral inhibition heuristic is functionally similar to our pragmatic reasoning mechanism, in terms of allowing the agent to learn from negative evidence (i.e. the speaker’s choice not to use a word, or the listener’s choice not to pick an object). Under our inferential framework, however, this property emerges as a natural consequence of well-established Gricean principles of pragmatic reasoning rather than as a heuristic.

5 General Discussion

Communication in a variable and non-stationary landscape of meaning creates unique computational challenges. To address these challenges, we advanced a hierarchical Bayesian approach in which agents continually adapt their beliefs about the form-meaning mapping used by each partner, in turn. We formalized this approach by integrating three core cognitive capacities in a probabilistic framework: representing initial uncertainty about what a partner thinks words mean (C1), partner-specific adaptation based on observations of language use in context (C2), and hierarchical structure for graded generalization to new partners (C3). This unified model resolves several puzzles that have posed challenges from prior models of coordination and convention formation: why referring expressions shorten over repeated interactions with the same partner (P1), how partner-specific common ground may coexist with the emergence of conventions at the population level (P2), and how context shapes which conventions emerge (P3).

We conclude by raising five broader questions that follow from the theoretical perspective we have advanced, each suggesting pathways for future work: (1) are the mechanisms underlying ad hoc convention formation in adults the same as those underlying language acquisition in children? (2) to what extent do these mechanisms depend on the communication modality? (3) what kinds of feedback are used to coordinate on conventions? (4) what kinds of social structure are reflected in the hierarchical prior, and (5) which representations are involved in adaptation at a process-level?

5.1 Continuity of language learning across development

There is a close mathematical relationship between our model of convention formation and recent probabilistic models of word learning in developmental science <e.g.¿XuTenenbaum07_WordLearningBayesian,FrankGoodmanTenenbaum09_Wurwur. This similarity suggests an intriguing conjecture that the continual-learning mechanisms adults use to rapidly coordinate on partner-specific conventions may be the same as those supporting lexical acquisition in children. In other words, people may never stop learning language; they may simply develop stronger and better-calibrated beliefs about their community’s conventions, which apply to a broader swath of communicative scenarios. In this section, we discuss three possible implications of viewing language development in terms of social coordination and convention.

First, common paradigms for cross-situational learning typically assume a fixed speaker and focus on variability across referential contexts siskind1996computational; regier2005emergence; smith2014unrealized; yurovsky2015integrative. Yet, as we have argued, variability in how words are used by different partners may pose an equally challenging problem. While there has been limited work on the social axis of generalization, it is increasingly apparent that children are able to track who produced the words they are learning and use this information to determine whether word meanings should generalize not just to other contexts but also to other speakers. For example, young children may limit the generalizability of observations from speakers who use language in idiosyncratic ways, such as a speaker who call a ball a “dog” koenig2010sensitivity; luchkina2018eighteen, and even retrospectively update beliefs about earlier words after observing such idiosyncracies dautriche2021.

This discounting of idiosyncratic speakers may be understood as an instance of the same inductive problem that convention formation poses for adults in P2. Unlike complete-pooling models, which predict that all observations should be taken as equally informative about the community’s conventions, our hierarchical model predicts that children should be able to explain away “outliers” without their community-level expectations being disrupted. Indeed, a novel prediction generated by our account is that children should be able to accommodate idiosyncratic language within extended interaction with the same speaker (e.g. continue to pretend the ball is called “dog,” given partner-specific common ground) while also limiting generalization of that convention across other speakers.

Second, we have emphasized the importance of representing lexical uncertainty

, capturing expected variability in the population beyond the point estimates assumed by traditional lexical representations. But how do children calibrate their lexical uncertainty? One possibility assigns a key role to the number of distinct speakers in a child’s environment, by analogy to the literature on talker variability

creel2011talker; clopper2004effects. Exposure to fewer partners may result in weaker <e.g.¿lev2017talking or mis-calibrated priors for meanings. If an idiosyncratic construction is over-represented in the child’s environment, they may later be surprised to find that it was specific to their household’s lexicon and not shared by the broader community <see¿[Chap. 6]Clark09_FirstLanguageAcquisition. Conversely, however, hierarchical inference predicts a blessing of abstraction GoodmanUllmanTenenbaum11_TheoryOfCausality: under certain conditions, community-level conventions may be inferred even with relatively short sparse observations from each partner. To resolve these questions, future work will need to develop new methods for eliciting children’s expectations about partner-specificity and variability of meanings.

Third, our work suggests a new explanation for why young children may struggle to coordinate on ad hoc conventions with one another in repeated reference games GlucksbergKraussWeisberg66_DevoRefGames; KraussGlucksberg69_DevoReferenceGames; KraussGlucksberg77_SocialNonsocialSpeech. Children as old as fifth grade only improve with assistance from the experimenter matthews2007toddlers: instead of beginning with the long indefinite descriptions that adults provide, it was observed that children began with shorter descriptions like Mother’s dress <see also¿kempe2019adults. While these failures were initially attributed to limits on theory of mind use, this explanation has been complicated by findings that children cannot even interpret their own utterances after a delay asher1976children. In other words, errors are not well-explained by egocentric adherence to a strongly preferred label, or downstream failures to acknowledge that this preferred label may not be understood by their partner; there appears to be no such preferred label.

Our model raises the possibility that the problem may instead stem from production with an impoverished lexical prior: there are no existing conventions for such a novel object in their vocabulary goldberg2019explain, leaving their preferences dispersed widely over “good-enough” constructions. Indeed, when children are paired with their caregivers rather than peers, they are able to successfully form conventions LeungEtAl20_Pacts. Parents helped to interactively scaffold these conventions, both by proactively seeking clarification in the listener role <e.g.¿anderson1994interactive and by providing more descriptive labels in the speaker role, which children adopted themselves on later trials. These results highlight one way in which our model of convention formation may coincide with models of language acquisition in development: in both cases, listeners are trying to infer the meanings of words in the speaker’s lexicon bohn2019pervasive.

While we have highlighted three particular developmental phenomena where our approach may generate novel predictions, there are many finer-grained questions raised by our computational approach. For example, is the child’s lexical expectations in communication best explained as a (resource-limited) representation of others’ lexicons, or as an egocentric (asocial) epistemic state? To what extent is the ability to retrieve or use this lexical prior constrained by theory of mind development? Even infants are sensitive to coarse social distinctions based on foreign vs. native language KinzlerDupouxSpelke07_LanguageGroups, or accent KinzlerEtAl09_AccentRace, but when do fully partner-specific representations develop?

5.2 The role of communication modality

While we have focused primarily on verbal and textual communication channels, there has beens significant progress understanding the dynamics of adaptation in other communication modalities, including graphical GarrodFayLeeOberlanderMacLeod07_GraphicalSymbolSystems; TheisenEtAl10_SystematicityArbitrariness; hawkins2019disentangling gestural FayListerEllisonGoldinMeadow13_GestureBeatsVocalization; motamedi2019evolving; bohn2019young and other de novo modalities Galantucci05_EmergenceOfCommunication; RobertsGalantucci12_DualityOfPatterning; RobertsEtAl15_IconocityOnCombinatoriality; VerhoefRobertsDingemanse15_Iconicity; VerhoefEtAl16_TemporalLanguage; kempe2019adults. These modalities are important for our account in several ways.

Most importantly, it is a core claim of our hierarchical account that the basic learning mechanisms underlying adaptation and convention formation are domain-general. In other words, we predict that there is nothing inherently special about spoken or written language: any system that humans use to communicate should display similar ad hoc learning and convention formation dynamics because in every case people are simply trying to infer the system of meaning being used by their partner. Directly comparing behavior in repeated reference games across different modalities is therefore necessary to determine which adaptation effects, if any, are robust and attributable to modality-general mechanisms.

At the same time, our hierarchical learning model predicts a critical role for the priors we build up across interactions with many individuals. We therefore predict that different communication modalities should display certain systematic differences due to the representational structure of the communication channel. For example, in the verbal modality, the tangram shapes from ClarkWilkesGibbs86_ReferringCollaborative are highly “innominate” HupetEtAl91_CodabilityReference

– most people do not have much experience naming or describing them with words, so their priors are weak and local adaptation plays a greater role. In the graphical modality, where communication takes place by drawing on a shared sketchpad, people can be expected to have a stronger prior rooted in assumptions about shared perceptual systems and visual similarity

fan2018common – drawing a quick sketch of the tangram’s outline may suffice for understanding.

Other referents have precisely the opposite property: to distinguish between natural images of dogs, people may have strong existing conventions in the linguistic modality (e.g. ‘husky’, ‘poodle’, ‘pug’) but making the necessarily fine-grained visual distinctions in the graphical modality may be initially very costly for novices fan2020pragmatic, requiring the formation of local conventions to achieve understanding hawkins2019disentangling. The gestural modality also has its own distinctive prior, which also allows communicators to use time and the space around them to convey mimetic or depictive meanings that may be difficult to encode verbally or graphically goldin-meadow_role_1999; clark2016depicting; mcneill1992hand. We suggest that differences in production and comprehension across modalities may therefore be understood by coupling modality-specific priors with modality-generic learning mechanisms.

5.3 The role of feedback and backchannels

If convention formation is grounded in inference, then an important corollary of our model is that the extent to which partners are able to coordinate should depend critically on the observations they condition on in Eq. 4. In the complete absence of feedback — when the speaker is instructed to repeatedly refer to a set of objects for a listener who is not present and will do their half of the task offline — there is no reduction in message length HupetChantraine92_CollaborationOrRepitition. Our simulations examined the most minimal sources of feedback: the speaker’s utterance and the listener’s response in a reference game. A key direction for future work is to account for richer forms of feedback that arise in natural communication. For example, a key feature of dialogue is the capacity for a real-time back-channel. The listener may say anything at any point in time, thus allowing for interjections (uh-huh, hmmm, huh?), clarification questions, and other listener-initiated forms of feedback. Elaborating the generative model of the listener to include these verbal behaviors will be critical to explaining other key results from ClarkWilkesGibbs86_ReferringCollaborative, including changes in the frequency of listener-initiated feedback over the course of interaction.

Additionally, such an elaboration would allow our model to address early empirical results from KraussWeinheimer66_Tangrams manipulating the feedback channel: participants were able to talk bidirectionally in one condition but in another condition, the channel was unidirectional. The speaker was prevented from hearing the listener’s responses. This feedback manipulation was crossed with a behavioral feedback manipulation where the experimenters intercepted the listener’s responses: one group of speakers was told that their partner made the correct response 100% of the trials (regardless of their real responses), while another was told on half of the trials that their partner made the incorrect response. Our account of increasing efficiency (P1) predicts that speaker may not have sufficient evidence to justify shorter utterances in the absence of evidence about how their longer descriptions are being interpreted. Indeed, KraussWeinheimer66_Tangrams found that both channel contribute independently to gains in efficiency. Speakers kept using longer utterances when they observed that their partner was making errors. But blocking the real-time verbal back-channel significantly limited reduction, even when told that their partner was achieving perfect accuracy. In the extreme case of trying to communicate to a listener who can’t respond and also appears to not understand, speaker utterance length actually increased with repetition after an early dip.

More graded disruptions of feedback seem to force the speaker to use more words overall but not to significantly change the rate of reduction. For example, KraussBricker67_Delay tested a transmission delay to temporally shift feedback and an access delay to block the onset of listener feedback until the speaker is finished. Later, KraussEtAl77_AudioVisualBackChannel replicated the adverse effect of delay but showed that undelayed visual access to one’s partner cancelled out the effect and returned the number of words used to baseline. On the listener’s part, too, the ability to actively give feedback appears critical for coordination. SchoberClark89_Overhearers showed that even listeners who overheard the entire game were significantly less accurate than listeners who could directly interact with the speaker, even though they heard the exact same utterances, presumably because the speaker was not able to take their particular sources of confusion into account. Our model provide a formal framework to begin probing how these different forms of communicative feedback may license different inferences about one’s partner in interaction.

5.4 The role of social knowledge

Real-world communities are much more complex than the simple networks we considered: each speaker takes part in a number of overlapping subcommunities. For example, we use partially distinct conventions depending on whether we are communicating with psychologists, friends from high school, bilinguals, or children auer_code-switching_2013. When a scientist is talking to other scientists about their work, they know they can use efficient technical shorthand that they would avoid when talking to their non-expert friends and family. Previous work has probed representations of community membership by manipulating the extent to which cultural background is shared between speaker and listener. For example, IsaacsClark87_ReferencesExpertsNovices paired participants who had either lived in NYC or had never been there for a task referring to landmarks in the city (e.g. “Rockefeller Center”). Within just a few utterances from a novel partner, people could infer whether they were playing with an expert or novice and immediately adjust their language use to be appropriate for this inferred identity. Social information about a partner’s group can be so important that even players in artificial-language games react to the restrictions of social anonymity by learning to identify members of their community using distinctive signals roberts_experimental_2010.

For future work using hierarchical Bayesian models to address the full scale of an individual’s network of communities, additional social knowledge about these communities must be learned and represented in the generative model. Larger-scale networked experiments can be used to evaluate the hypothesis that a hierarchical representation of conventions includes not just a partner-specific level and population-wide level but also intermediate community levels. This hypothesis can be formalized by including additional latent representations of community membership into our hierarchical model. That is, in addition to updating our model of a particular partner based on immediate feedback, even sparse observations of a partner’s language use may license much broader inferences about their lexicon via diagnostic information about their social group or background. If someone’s favorite song is an obscure B-side from an 80s hardcore band, you can make fairly strong inferences about what else they like to listen to and how similar they might be to you VelezEtAl16_Overlaps; GershmanEtAl17_StructureSocialInfluence. Similarly, if someone casually refers to an obscure New York landmark, you may be able to update your beliefs not just about that lexical item but about a number of other lexical conventions shared among New Yorkers. Lexica cluster within social groups, so inverting this relationship can yield rapid lexical learning from inferences about social group membership.

This explanation is also consistent with broader linguistic phenomena outside the realm of repeated reference games. For example, PottsLevy15_Or showed that lexical uncertainty is critical for capturing constructions like oenophile or wine lover, where a disjunction of synonymous terms is taken to convey a definition – information about the speaker’s lexicon – rather than a disjoint set. While the reasons that speakers produce such constructions are complex, we would expect that speakers will be more likely to produce the definitional or when the component word is expected to be rarer or more obscure for a particular partner: when there is additional uncertainty over its likely meaning in the listener’s lexicon.

5.5 Process-level mechanisms for adaptation

Finally, while we have provided a computational-level account of coordination and convention formation in terms of hierarchical inference, there remain many possible process-level mechanisms that may perform this computation. In this section, we discuss two interlocking process-level questions: (1) exactly which representations are being adapted? (2) how does our model scale to larger spaces of utterances and referents?

5.5.1 Which representations are adapted?

While our model formulation focused on adaptation at the level of the lexicon (i.e. inferences about , representing different possible lexical meanings), this is only one of many internal representations that may need to be adapted to achieve successful coordination. Three other possible representational bases have been explored in the literature.

First, it is possible that adaptation takes place upstream of the lexicon, directly implicating perceptual or conceptual representations GarrodAnderson87_SayingWhatYouMean; HealeySwobodaUmataKing07_GraphicalLanguageGames That is, there may also be uncertainty about how a particular partner construes the referent itself, and communication may require constructing a shared, low-dimensional conceptual space where the relevant referents can be embedded stolk2016conceptual. This is particularly clear in the classic maze task GarrodAnderson87_SayingWhatYouMean where giving effective spatial directions requires speakers to coordinate on what spatial representations to use (e.g. paths, coordinates, lines, or landmarks).

Second, it is possible that adaptation takes place even further upstream, at the level of social representations jaech2018low. Rather than directly updating beliefs about lexical or conceptual representations, we may update a holistic representation of the partner themselves (e.g. as a “partner embedding” in a low-dimensional vector space) that is used to retrieve downstream conceptual and lexical representations. Under this representational scheme, the mapping from the social representation to particular conventions is static, and ad hoc adaptation is limited to learning where a particular partner belongs in the overall social space.

Third, it is possible that expectations about other lower-level features of a partner are also adapted through interaction. For example, interactive alignment accounts pickering2004toward have argued that activating phonetic or syntactic features that are associated with specific lemmas in the lexicon can percolate up to strengthen higher levels of representation roelofs1992spreading; pickering1998representation. Thus, learning about a partner’s word frequency louwerse2012behavior, syntax gruberg2019syntactic; levelt1982surface, body postures lakin2003using, speech rate giles1991contexts, or even informational complexity abney2014complexity could be functionally useful for communication if they covary with, or are informative about, higher-level representations.

5.5.2 Scalability to larger utterance and referent spaces

While a fully Bayesian formulation elegantly captures the inference problem at the core of our theory, the posterior update step in Eq. 4

grows increasingly intractable as the space of utterances and referents grows. This computational limitation is especially evident when considering how to quantitatively fit models to the natural-language data produced in unconstrained reference games, or applications in modern machine learning where we would like to build artificial agents that can adapt to human partners. Generalizing our framework to the arbitrary natural language (e.g. referring expressions using the full vocabulary of an adult language user) and arbitrary visual input (e.g. sensory impressions of novel objects such as tangrams) that are the basis for natural communication not only requires a different representational space for language and visual input, it may require different inference algorithms.

The first computational obstacle is the lexical representation. A discrete matrix containing entries for each utterance-referent pair, as typically used in convention formation simulations, has two primary limitations as a proxy for human representations: (1) it grows quadratically as additional utterances and referents are added while becoming increasingly sparse (i.e. most words only apply to a relatively small set of referents) and (2) it does not straightforwardly represent similarity relationships between utterances and referents (i.e. a new referent must be added as a distinct new column, even if it is visually similar to a known referent). This limitation is especially clear when referents are presented as raw images.

The second computational obstacle is the inference algorithm. Even if a more scalable family of lexical representations is chosen, any parameterization of this representation will be much higher-dimensional than we have considered. For example, if are defined to be the weights of a neural network, then maintaining uncertainty would require placing a prior over the weights and the posterior update would require a difficult integral over a very high-dimensional support. This is called a (hierarchical) Bayesian neural network mackay1992practical; neal2012bayesian. While recent algorithmic breakthroughs have made (approximate) inference for such networks increasingly tractable <e.g.¿hernandez2015probabilistic,lacoste2018uncertainty,dusenberry2020efficient,izmailov2020subspace, they remain untested as cognitive models.

While these obstacles are significant, we argue that our hierarchical Bayesian framework is nonetheless well-poised to explain coordination and convention-formation at scale. In particular, recent formal connections between hierarchical Bayes and gradient-based meta-learning approaches in machine learning grant_recasting_2018 suggest an alternative algorithm that relaxes the full prior over to a point estimate and replaces the difficult integral in the posterior update with a handful of (regularized) gradient steps. This more tractable algorithm approximates the same computational-level problem we have formulated but provides a different perspective: conventions are learned initializations and coordination is partner-specific fine-tuning or domain adaptation of vector representations. The fine-tuning approach, initializing with a state-of-the-art neural language model, has recently accounted for psycholinguistic data on reading times van2018neural as well as ad hoc convention formation in reference games using (unseen) natural images as referents hawkins2019continual.

Conclusion

We have argued that successful communication depends on continual learning across multiple timescales. We must coordinate on ad hoc meaning through common ground with individual partners but also abstract these experiences away to represent stable, generalizable conventions and norms. Like other socially-grounded knowledge, language is not a rigid dictionary that we acquire at an early age and deploy mechanically for the rest of our lives. Nor does language only change over the slow time-scales inter-generational drift. Language is a means for communication – a shared interface between minds – and must therefore adapt over the rapid contextual timescales required by communication. As new ad hoc concepts arise, new ad hoc conventions must be formed to solve the new coordination problems they pose. In other words, we are constantly learning language. Not just one language, but a family of related languages, across interactions with each partner.

Let us conclude not that ‘there is no such thing as a language’ that we bring to interaction with others. Say rather that there is no such thing as the one total language that we bring. We bring numerous only loosely connected languages from the loosely connected communities that we inhabit. hacking1986nice

References

Appendix A: Details of RSA model

Our setting poses several technical challenges for the Rational Speech Act (RSA) framework. In this Appendix, we describe these challenges in more detail and justify our choices.

5.6 Handling degenerate lexicons

First, when we allow the full space of possible lexicons , we must confront degenerate lexicons where an utterance is literally false of every object in context, i.e. where for all . In this case, the normalizing constant in Eq. 2 is zero, and the literal listener distribution is not well-defined. A similar problem may arise when no utterance in the speaker’s repertoire is true of the target, in which case the distribution is not well-defined.

Several solutions to this problem were outlined by bergen_pragmatic_2016. One of these solutions is to use a ‘softer’ semantics in the literal listener, where a Boolean value of false does not strictly rule out an object but instead assigns a very low numerical score, e.g.

Whenever there is at least one where is true, this formulation assigns negligible listener probability to objects where is false, but ensures that the normalization constant to is non-zero even when is false for all objects.

While this solution suffices for one-shot pragmatics under lexical uncertainty, where may be calibrated to be appropriately large, it runs into several technical complications in an iterated setting. First, due to numerical overflow at sufficiently high values of and at later iterations, elements may drop entirely out of the support at higher levels of recursion (e.g. ), leading the normalization constant to return to zero. Second, this ‘soft’ semantics creates unexpected and unintuitive consequences at the level of the pragmatic speaker. After renormalization in , an utterance that fails to refer to any object in context is also by definition equally successful for all objects (i.e. evaluating to for every object), leading to a uniform selection distribution. Consequently, may in some cases prefer utterances that are literally false of the target just as much as utterances which are true.

Instead of injecting into the lexical meaning, we ensure that the normalization constant is well-defined by adapting another method suggested by bergen_pragmatic_2016. First, we add a ‘null’ object to every context so that, even if a particular utterance is false of every real object in context, it will still apply to the null object, assigning the true target a negligible probability of being chosen. Intuitively, this null object can be interpreted as recognizing that the referring expression has a referent but it is not in context. Second, we add an explicit noise model at every level of recursion. That is, we assume every agent has a probability of choosing a random element of their support, ensuring a fixed non-zero floor on the likelihood of each element that is constant across levels of recursion. Formally this corresponds to a mixture distribution, e.g.