Objects are best understood in terms of their structure and function, both of which are built on a foundation of object parts and their relations [10, 9, 55, 8]. Natural languages have been optimized across human history to solve the problem of efficiently communicating the aspects of the world most relevant to one’s current goals [22, 12]. As such, languages can provide an effective medium to describe the shapes and the parts of different objects, and to express object differences. For example, when we see a chair we can decompose it into semantically meaningful parts, like a back and a seat, and can combine words to create utterances that reflect their geometric and topological shape-properties e.g. ‘wide seat with a solid back’. Moreover, given a specific communication context, we can craft references that are not merely true, but which are also relevant: i.e. we can refer to the lines found in a chair’s back to distinguish it among other similar objects (see Fig. 1).
In this paper we explore this interplay between natural, referential language, and the shape of common objects. While a great deal of recent work has explored visually-grounded language understanding [20, 32, 52, 27, 26, 51], the resulting models have limited capacity to reflect the geometry and topology (i.e. the shape) of the underlying objects. This is because reference in previous studies was possible using properties like color, or properties regarding the object and it’s hosting environment (e.g. it’s absolute, or relative to other objects, location). Indeed, eliciting natural language that refers only to shape properties requires carefully controlling the objects, their presentation, and the linguistic task. To address such challenges, we use pure 3D representations of objects (CAD models), which allow for flexible and controlled presentation (i.e. textureless, uniform-color objects, viewed without obstruction in a fixed pose). We further make use of the 3D form to construct a reference game task in which the referred object is similar shape-wise to the contrasting objects. The result of this effort is a new multimodal dataset, termed CiC (Chairs in Context), comprised of 4,511 unique chairs from ShapeNet  and 78,789 referential utterances. In CiC chairs are organized into 4,054 sets of size 3 (representing contrastive communication contexts) and each utterance is intended to distinguish a chair in context. The visual differences among the grouped objects require a deep understanding of very fine-grained shape properties (especially, for Hard contexts, see Section 2); the language that people use to do so is correspondingly complex, exhibiting rich compositionality.
We use CiC to train and analyze a variety of modern neural language understanding (listening) and production (speaking) models. These models vary in their grounding (pure 3D forms via point-clouds vs. rendered 2D images), the degree of pragmatic reasoning captured (e.g. speakers that reason about a listener or not) and the neural architecture (e.g. with or without word attention, and with context-free or context-aware object encoders). We evaluate these models on the original reference game task with both synthetic and human partners, and with held out utterances and objects, finding strong performance. Since language conveys abstractions, such as object parts, that are shared between object categories, we hypothesized that our models learn robust representations that are transferable to objects of unseen classes (e.g. training on chairs while testing on lamps). Indeed, we show that these models have strong generalization capacity to novel object categories, as well as to real-world colored images drawn from furniture catalogs.
Finally, we explore how our models are succeeding on these communication tasks. We demonstrate that the neural listener learns to prioritize the same abstractions in objects (i.e. properties of chair parts) that humans do in solving the communication task, despite never being provided with an explicit decomposition of these objects into parts. Similarly, we show that transfer learning to novel object classes is most successful when known part-related words are available. Last, we show that a neural speaker that is pragmatic—planing utterances in order to convey the right target object to an imagined listener —produces significantly more informative utterances than a literal (listener-unaware) speaker, as measured by human performance in identifying the correct object.
2 Dataset and task
CiC (Chairs in Context) consists of triplets of chairs coupled with referential utterances that aim to distinguish one chair (the “target”) from the remaining two (the “distractors”). To obtain such utterances, we paired participants from Amazon’s Mechanical Turk (AMT) to play an online reference game . On each round of the game, the two players were shown the same triplet of chairs. The designated target chair was privately highlighted for one player (the “speaker”) who was asked to send a message through a chat box such that their partner (the “listener”) could successfully select it from the context. To ensure speakers used only shape-related information, we scrambled the positions of the chairs for each participant independently and used textureless, uniform-color renderings of pre-aligned 3D CAD models, taken from the same viewpoint. To ensure communicative interaction was natural, no constraints were placed on the chat box: referring expressions from the speaker were occasionally followed by clarification questions from the listener or other discourse.
A key decision in building our dataset concerned the construction of contexts that would reliably elicit diverse and potentially very fine-grained contrastive language. To achieve diversity we considered all 7,000 chairs from ShapeNet. This object class is geometrically complex, highly diverse, and abundant in the real world. To control the granularity of fine-grained distinctions that were necessary in solving the communication task, we constructed two types of contexts: Hard contexts consisted of very similar shape-wise chairs, and Easy
contexts consisted of less similar chairs. To measure shape-similarity in an unsupervised manner, we used the latent space derived from an Point Cloud-AutoEncoder (PC-AE). We note, that point-clouds are an intrinsic representation of a 3D object, oblique to color or texture. After extracting a 3D point-cloud from the surface of each ShapeNet model we computed the underlying K-nearest-neighbor graph among all models according to their PC-AE embedding distances. For a chair with sufficiently high-in degree on this graph (corresponding intuitively to a canonical chair) we contrasted it with four distractors: the two closest to it in latent-space, and two that were sufficiently far (see inset for a demonstration and at the Appendix for additional details). Last, to reduce potential data biases we counterbalanced each communication context, by considering every chair of a given context as target, in at least four games.
Before presenting our neural agents, we identify some distinctive properties of our corpus. Human performance on the reference game was high, but listeners made significantly more errors in the Hard triplets (accuracy vs. ). Also, in Hard triplets longer utterances were used to describe the targets (on average 8.4 words vs. 6.1, ). A wide spectrum of descriptions was elicited, ranging from the more holistic/categorical (e.g. “the rocking chair”) common for Easy triplets, to more complex and fine-grained language, (e.g. “thinner legs but without armrests”) common for Hard triplets. Interestingly, 78% of the utterances used at least one part-related word: “back”, “legs”, “seat,” “arms”, or closely related synonyms e.g. “armrests”.
3 Neural listeners
Constructing neural listeners that reason effectively about shape properties is a key contribution of our work. Below we conduct a detailed comparison between three distinct architectures, highlight the effect of different regularization techniques, and investigate the merits of different representations of 3D objects for the listening task, namely, 2D rendered images and 3D surface point clouds. In what follows, we denote the three objects of a communication context as , the corresponding word-tokenized utterance as and as the designated target.
Our proposed listener is inspired by 
. It takes as input a (latent code) vector that captures shape information for each of the objects in, and a (latent code) vector for each token of , and outputs an object–utterance compatibility score for each input object. At its core lies a multi-modal LSTM  that receives as input (“is grounded” with) the vector of a single object, processes the word-sequence , and is read out by a final MLP to yield a single number (the compatibility score). This is repeated for each , sharing all network parameters across the objects. The resulting three scores are soft-max normalized and compared to the ground-truth indicator vector of the target under the cross-entropy loss.222Architecture details, hyper-parameter search strategy, and optimal hyper-parameters for all experiments are described in the Appendix.
Object encoders We experimented with three object representations to capture the underlying shapes: (a) the bottleneck vector of a pretrained Point Cloud-AutoEncoder (PC-AE), (b) the embedding provided by a convolutional network operating on single-view images of non-textured 3D objects, or (c) a combination of (a) and (b). Specifically, for (a) we use the PC-AE architecture of  trained with single-class point clouds extracted from the surfaces of 3D CAD models, while for (b) we use the activations of the penultimate layer of a VGG-16 
, pre-trained on ImageNet, and fine-tuned on an 8-way classification task with images of objects from ShapeNet. For each representation we project the corresponding latent code vector to the input space of the LSTM using a fully connected (FC) layer with -norm weight regularization. The addition of these projection-like layers improves the training and convergence of our system.
While there are many ways to simultaneously incorporate the two modalities in the LSTM, we found that the best performance resulted when we ground the LSTM with the image code, concatenate the LSTM’s final output (after processing ) with the point cloud code, and feed the concatenated result in a shallow MLP to produce the compatibility score. We note that grounding the LSTM with point clouds and using images towards the end of the pipeline, resulted in a significant performance drop ( on average). Also, proper regularization was critical: adding dropout at the input layer of the LSTM and weight regularization and dropout at and before the FC projecting layers improved performance . The token codes of each sentence where initialized with the GloVe embedding  and fine-tuned for the listening task.
Incorporating context information Our proposed baseline listener architecture (Baseline, just described) first scores each object separately then applies softmax normalization to yield a score distribution over the three objects. We also consider two alternative architectures that explicitly encode information about the entire context while scoring an object. The first alternative (Early-Context), is identical to the proposed architecture, except for the codes used to ground the LSTM. Specifically, if is the image code vector of the i-th object () resulting from VGG, instead of using as the grounding vector of , a shallow convolutional network is introduced. This network, of which the output is the grounding code, receives the signal , where
are the symmetric (and norm-normalized), max-pool and mean-pool functions,denotes feature-wise concatenation and the alternative constrastive objects. Here, we use symmetric functions to induce that object-order is irrelevant for our task. The second alternative architecture (Combined-Interpretation) first feeds the image vectors for all three objects sequentially to the LSTM as inputs and then proceeds to process the tokens of once, to yield the three scores. Similarly to the Baseline architecture, point clouds are incorporated in both alternatives via a separate MLP after the LSTM.
Word attention We hypothesized that a listener forced to prioritize a few words in each utterance would learn to prioritize words that express properties that distinguish the target from the distractors (and, thus, perform better). To test this hypothesis, we augment the listener models with a standard bilinear attention mechanism 
. Specifically, to estimate the “importance” of each text-tokenwe compare the output of the LSTM at (denoted as ) with the hidden state after the entire utterance has been processed (denoted as ). The relative importance of each word is , where is a trainable diagonal matrix. The final output of the LSTM uses this attention to combine all latent states: , where and is the point-wise product.
4 Listener experiments
We begin our evaluation of the proposed listeners using two reference tasks based on different data splits. In the language generalization task, we test on target objects that were seen as targets in at least one context during training but ensure that all utterances in the test split are from unseen speakers. In the more challenging object generalization task, we restrict the set of objects that appeared as targets in the test set to be disjoint from those in training such that all speakers and objects in the test split are new. For each of these tasks, we evaluate choices of input modality and word attention, using of the data, for training, validating and testing purposes.
Baseline listener accuracies are shown in Table 2.333 Overall the model achieves good performance. As expected, all listeners have higher accuracy on the language generalization task ( on average). The attention mechanism on words yields a mild performance boost, as long as images are part of the input. Interestingly, images provide a significantly better input than point-clouds when only one modality is used. This may be due to the higher-frequency content of images (we use point-clouds with only 2048 points), or the fact that VGG was pre-trained while the PC-AE was not. However, we find significant gains in accuracy ( on average) from exploiting the two object representations simultaneously, implying a complementarity among them.
Next, we evaluate how the different approaches in incorporating context information described in Section 3 affect listener performance. We focus on the more challenging object generalization task, using listeners that include attention and both object modalities. We report the findings in Table 1. We find that the Baseline and Early-Context models perform best overall, outperforming the Combined-Interpretation model, which does not share weights across objects. This pattern held for both hard and easy trial types in our dataset. We further explored the small portion () of our test set that use explicitly contrastive language: superlatives (“skinniest”) and comparatives (“skinnier”). Somewhat surprisingly we find that the Baseline architecture remains competitive against the architectures with more explicit context information. The Baseline model thus achieves high performance and is the most flexible (at test time it can be applied to arbitrary-sized contexts); we focus on this architecture in the explorations below.
4.1 Exploring learned representations
Which aspects of a sentence are most critical for our listener’s performance? To inspect the properties of words receiving the most attention, we ran a part-of-speech tagger on our corpus. We found that the highest attention weight is placed on nouns, controlling for the length of the utterance. However, adjectives that modify nouns received more attention in hard contexts (controlling for the average occurrence in each context), where nouns are often not sufficient to disambiguate (see Fig. 2A). To more systematically evaluate the role of higher-attention tokens in listener performance, we conducted an utterance lesioning experiment. For each utterance in our dataset, we successively replaced words with the <UNK> token according to three schemes: (1) from highest attention to lowest, (2) from lowest attention to highest, and (3) in random order. We then fed these through an equivalent listener trained without attention. We found that up to 50% of words can be removed without much performance degradation, but only if these are low attention words (see Fig. 2B). Our word-attentive listener thus appears to rely on context-appropriate content words to successfully disambiguate the referent.
To test the extent to which our listener is relying on the same semantic parts of the object as humans, we next conducted a lesion experiment on the visual input. We took the subset of our test set where (1) all chairs had complete part annotations available  and (2) the corresponding utterance mentioned a single part (17% of our test set). We then created lesioned versions of all three objects on each trial by removing pixels of images (and/or points when point-clouds are used), corresponding to parts according to two schemes: removing a single part or keeping a single part. We did this either for the mentioned one, or another part, chosen at random. We report listener accuracies on these lesioned objects in Table 3. We found that removing random parts hurts the accuracy by 10.4% on average, but removing the mentioned part dropped accuracy more than three times as much, nearly to chance. Conversely, keeping only the mentioned part while lesioning the rest of the image merely drops accuracy by 10.6% while keeping a non-mentioned (random) part alone brings accuracy down close to chance. In other words, on trials when participants depended on information about a part to communicate the object to their partner, we found that visual information about that part was both necessary and sufficient for the performance of our listener model.
|Single Part Lesioned||Single Part Present|
5 Neural speakers
We next explore models that learn to generate an utterance that refers to the target and which distinguishes it from the distractors. Similarly to a neural listener the heart of these models is an LSTM which encodes the objects of a communication context, and then decodes an utterance. Specifically, for an image-based model, on the first three time steps, the LSTM input is the VGG code of each object. Correspondingly, for a point-cloud-based model, the LSTM input is the object codes extracted from a PC-AE. During training and after the objects are encoded, the remaining input to the LSTM is the ‘current’ utterance token, while the output of the LSTM is compared with the ‘next’ utterance token, under the cross-entropy loss . The target object is always presented last, eliminating the need to represent the index of the target separately. To find the best model hyper-parameters (e.g.listener to select the result with the highest listener accuracy. We found this approach to produce models and parameters that yield better quality utterances than evaluating with listening-unaware metrics like BLEU .
The above (literal) speakers can learn to generate language that discriminates targets from distractors. To test the degree to which distractor objects are used for generation, we experiment with context-unaware speakers that are provided the encoding of the target object only, and are otherwise identical to the above models. Motivated by the recursive social reasoning characteristic of human pragmatic language use (as formalized in the Rational Speech Act framework ), we create pragmatic speakers that choose utterances according to their capacity to be discriminative, as judged by a pretrained “internal” listener. In this case, we sample utterances from the (literal) speakers, but score (i.e. re-rank) them with:
is the listener’s probability to predict the target () and is the likelihood of the literal speaker to generate . The parameter controls a length-penalty term to discourage short sentences , while controls the relative importance of the speaker’s vs. the listener’s opinions.
6 Speaker experiments
Qualitatively, our speakers produce good object descriptions, see Fig. 3 for examples, with the pragmatic speakers yielding more discriminating utterances.444The project’s webpage contains additional qualitative results. To quantitatively evaluate the speakers we measure their success in reference games with two different kinds of partners: with an independently-trained listener model and with human listeners. To conduct a fair study when we used a neural listener, we split the training data in half. The evaluating listener was trained using one half, while the scoring (or “internal”) listener used by the pragmatic speaker was trained on the remaining half. For our human evaluation, we used the literal and pragmatic variants to generate referring expressions on the test set (we use all training data to train the internal listeners here). We then showed these referring expressions to participants recruited on AMT and asked them to select the object from context that the speaker was referring to. We collected approximately responses for each triplet (we use unique triplets from the object-generalization test-split, annotated separately by each speaker model). The synthetic utterances used were the highest scoring ones (Eq. 1) for each model with optimal (per-validation) and a . We note that while the point-based speakers operate solely with point-cloud representations, we present their produced utterances to AMT participants accompanied by CAD rendered images, to keep the human-side presentation identical across experiments.
We found (see Table 4) that our pragmatic speakers perform best with both synthetic and human partners. While their success with the synthetic listener model may be unsurprising, given the architectural similarity of the internal listener and the evaluating listener, human listeners were percentage points better at picking out the target on utterances produced by the pragmatic vs. literal speaker for the best-performing (image-based) variant. We also found an asymmetry between the listening and speaking tasks: while context-unaware listeners achieved high performance, we found that context-unaware speakers fare significantly worse than context-aware ones. Last, we note that both literal and pragmatic speakers produce succinct descriptions (average sentence length vs. ) but the pragmatic speakers use a much richer vocabulary ( more unique nouns and more unique adjectives, after controlling for average length discrepancy).
7 Out-of-distribution transfer learning
Language is abstract and compositional. These properties make language use generalizable to new situations (e.g. using concrete language in novel scientific domains) and robust to low-level perceptual variation (e.g. lighting). In our final set of experiments we examine the degree to which our neural listeners and speakers learn representations that are correspondingly robust: that capture associations between the visual and the linguistic domains permit generalization out of the training domain.
Understanding out-of-class reference
To test the generalization of listeners to novel stimuli, we collected referring expressions in communication contexts made of objects in ShapeNet drawn from new classes: beds, lamps, sofas and tables. These classes are distinct from chairs, but share some parts and properties, making transfer possible for a sufficiently compositional model. For each of these classes we created 200 contexts made of random triplets of objects; we collected 2 referring expressions for each target in each context (from participants on AMT). Examples of visual stimuli and collected utterances are shown in Fig. 4 (bottom-row). To this data, we applied an (image-only, with/without-attention) listener trained on the CiC (i.e. chairs) data. We avoid using point-clouds since unlike VGG which was finetuned with multiple ShapeNet classes, the PC-AE was pre-trained on a single-class.
As shown in Table 5, the average accuracy is well above chance in all transfer categories (56% on average). Moreover, constraining the evaluation to utterances that contain only words that are in the CiC training vocabulary (75% of all utterances, column: known) only slightly improves the results. This is likely because utterances with unknown words still contain enough known vocabulary for the model to determine meaning. We further dissect the known population into utterances that contain part-related words (with-part) and their complement (without-part). For the training domain of chairs without-part utterances yield slightly higher accuracy. However the useful subcategories that support this performance (e.g. “recliner”) do not support transfer to new categories. Indeed, we observe that for transfer classes (except sofa) the listener performs better when part-related words are present. Furthermore, the performance gap between the two populations appears to become larger as the perceptual distance between the transfer and training domains increases (compare sofas to lamps).
|Class||entire||known||with part||without part|
Describing real images
Transfer from synthetic data to real data is often difficult for modern machine learning models, that are attuned to subtle statistics of the data. We explored the ability of our models to transfer to real chair images (rather than the training images which were rendered without color or texture from CAD models) by curating a modest-sized (300) collection of chair images from online furniture catalogs. These images were taken from asimilar view-point to that of the training renderings and have rich color and texture content. We applied the (image-only) pragmatic speaker to these images, after subtracting the average ImageNet RGB values (i.e. before passing the images to VGG). Examples of the speaker’s productions are shown in Figure 4. For each chair, we randomly selected two distractors and asked 2 AMT participants to guess the target given the utterance produced by our speaker. Human listeners correctly guessed the target chair of the time. Our speaker appears to transfer successfully to real images, which contain color, texture, pose variation, and likely other differences from our training data.
8 Related work
Image labeling and captioning Our work builds on recent progress in the development of vision models that involve some amount of language data, including object categorization [38, 54] and image captioning [19, 45, 49]. Unlike object categorization, which pre-specifies a fixed set of class labels to which all images must project, our systems use open-ended, referential language. Similarly to other recent works in image captioning [29, 32, 52, 43, 27, 26, 51], instead of captioning a single image (or entity therein), in isolation, our systems learn how to communicate across diverse communications contexts.
Reference games In our work we use reference games  in order to operationalize the demand to be relevant in context. The basic arrangement of such games can be traced back to the language games explored by Wittgenstein  and Lewis . For decades, such games have been a valuable tool in cognitive science to quantitatively measure inferences about language use and the behavioral consequences of those inferences [36, 23, 5, 42]. Recently, these approaches have also been adopted as a benchmark for discriminative or context-aware NLP [33, 2, 40, 44, 31, 6, 24].
Rational speech acts framework Our models draw on recent formalization of human language use in the Rational Speech Acts (RSA) framework . At the core of RSA is the Gricean proposal  that speakers are agents who select utterances that are parsimonious yet informative about the state of the world. RSA formalizes this notion of informativity as the expected reduction in the uncertainty of an (internally simulated) listener, as our pragmatic speaker does. The literal listener in RSA uses semantics that measure compatibility between an utterance and a situation, as our baseline listener does. Previous work has shown that RSA models account for context sensitivity in speakers and listeners [14, 31, 53, 11]. Our results add evidence for the effectiveness of this approach in complex domains.
In this paper, we explored models of natural language grounded in the shape of common objects. The geometry and topology of objects can be complex and the language we have for referring to them is correspondingly abstract and compositional. This makes the shape of objects an ideal domain for exploring grounded language learning, while making language an especially intriguing source of evidence for shape variations. We introduced the Chairs-in-Context corpus of highly descriptive referring expressions for shapes in context. Using this data we explored a variety of neural listener and speaker models, finding that the best variants exhibited strong performance. These models draw on both 2D and 3D object representations and appear to reflect human-like part decomposition, though they were never explicitly trained with object parts. Finally, we found that the learned models are surprisingly robust, transferring to real images and to new classes of objects. Future work will be required to understand the transfer abilities of these models and how this depends on the compositional structure they have learned.
The authors wish to acknowledge the support of a Sony Stanford Graduate Fellowship, a NSF grant CHS-1528025, a Vannevar Bush Faculty Fellowship and gifts from Autodesk and Amazon Web Services for Machine Learning Research.
-  P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. J. Guibas. Learning representations and generative models for 3d point clouds. Proceedings of the 35th International Conference on Machine Learning, 2018.
-  J. Andreas and D. Klein. Reasoning about pragmatics with neural listeners and speakers. CoRR, 2016.
-  J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
-  A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model repository. CoRR, abs/1512.03012, 2015.
-  H. H. Clark and D. Wilkes-Gibbs. Referring as a collaborative process. Cognition, 22(1):1–39, 1986.
-  R. Cohn-Gordon, N. Goodman, and C. Potts. Pragmatically informative image captioning with character-level reference. CoRR, abs/1804.05417, 2018.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
-  A. Dubrovina, F. Xia, P. Achlioptas, M. Shalah, and G. J. Leonidas. Composite shape modeling via latent space factorization. CoRR, abs/1901.02968, 2019.
-  P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. TPAMI, 2010.
-  A. M. Fischler and E. A. Robert. The representation and matching of pictorial structures. IEEE Trans. on Computers., 1973.
-  D. Fried, J. Andreas, and D. Klein. Unified pragmatic models for generating and following instructions. CoRR, abs/1711.04987, 2017.
-  E. Gibson, R. Futrell, J. Jara-Ettinger, K. Mahowald, L. Bergen, S. Ratnasingam, M. Gibson, S. T. Piantadosi, and B. R. Conway. Color naming across languages reflects color use. Proceedings of the National Academy of Sciences, 114(40):10785–10790, 2017.
-  N. D. Goodman and M. C. Frank. Pragmatic language interpretation as probabilistic inference. Trends in Cognitive Sciences, 20(11):818 – 829, 2016.
-  C. Graf, J. Degen, R. X. D. Hawkins, and N. D. Goodman. Animal, dog, or dalmatian? level of abstraction in nominal referring expressions. In Proceedings of the 38th Annual Conference of the Cognitive Science Society, 2016.
-  H. P. Grice. Logic and conversation. In P. Cole and J. Morgan, editors, Syntax and Semantics, pages 43–58. Academic Press, New York, 1975.
-  R. X. D. Hawkins. Conducting real-time multiplayer experiments on the web. Behavior Research Methods, 47(4):966–976, 2015.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
-  A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In , pages 3128–3137, 2015.
-  S. Kazemzadeh, V. Ordonez, M. Mark, and B. L. Tamara. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  S. Kirby, M. Tamariz, H. Cornish, and K. Smith. Compression and communication in the cultural evolution of linguistic structure. Cognition, 141:87–102, 2015.
-  R. M. Krauss and S. Weinheimer. Changes in reference phrases as a function of frequency of usage in social interaction: A preliminary study. Psychonomic Science, 1964.
-  A. Lazaridou, K. M. Hermann, K. Tuyls, and S. Clark. Emergence of linguistic communication from referential games with symbolic and pixel input. CoRR, abs/1804.03984, 2018.
-  D. Lewis. Convention: A philosophical study. Harvard University Press, 1969.
-  J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. CVPR, 2018.
-  R. Luo and G. Shakhnarovich. Comprehension-guided referring expressions. In Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 2013.
-  J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and M. Kevin. Generation and comprehension of unambiguous object descriptions. CoRR, abs/1511.02283, 2016.
-  T. Miyato, D. M. Andrew, and G. Ian. Adversarial training methods for semi-supervised text classification. International Conference on Learning, 2017.
-  W. Monroe, R. X. Hawkins, N. D. Goodman, and C. Potts. Colors in context: A pragmatic neural model for grounded language understanding. CoRR, abs/1703.10186, 2017.
-  K. V. Nagaraja, I. V. Morariu, and D. S. Larry. Modeling context between objects for referring expression understanding. ECCV, 2016.
-  M. Paetzel, D. N. Racca, and D. DeVault. A multimodal corpus of rapid dialogue games. In Language Resources and Evaluation Conference (LREC), 2014.
-  K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics.
J. Pennington, R. Socher, and C. Manning.
Glove: Global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
-  S. Rosenberg and B. D. Cohen. Speakers’ and listeners’ processes in a word-communication task. Science, 1964.
-  S. Shen and H. Lee. Neural attention models for sequence classification: Analysis and application to key term extraction and dialogue act detection. CoRR, abs/1604.00077, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1), 2014.
-  J.-C. Su, C. Wu, H. Jiang, and S. Maji. Reasoning about fine-grained attribute phrases using reference games. CoRR, abs/1708.08874, 2017.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
-  K. van Deemter. Computational models of referring: a study in cognitive science. MIT Press, 2016.
-  R. Vedanta, S. Bengio, K. Murphy, D. Parikh, and G. Chechik. Context-aware captions from context-agnostic supervision. CoRR, abs/1701.02870, 2017.
-  R. Vedantam, S. Bengio, K. Murphy, D. Parikh, and G. Chechik. Context-aware captions from context-agnostic supervision. In Computer Vision and Pattern Recognition (CVPR), volume 3, 2017.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2015.
R. J. Williams and D. Zipser.
A learning algorithm for continually running fully recurrent neural networks.Neural Comput., 1989.
-  L. Wittgenstein. Philosophical investigations. Macmillan, 1953.
-  Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. S. Corrado, M. Hughes, and J. Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044, 2016.
-  L. Yi, H. Su, X. Guo, and L. J. Guibas. Syncspeccnn: Synchronized spectral CNN for 3d shape segmentation. CoRR, abs/1612.00606, 2016.
-  L. Yu, Z. Lin, X. Shen, Y. Jimei, X. Lu, M. Bansal, and L. T. Berg. Mattnet: Modular attention network for referring expression comprehension. CVPR, 2018.
-  L. Yu, P. Poirson, S. Yang, C. A. Berg, and L. T. Berg. Modeling context in referring expressions. ECCV, 2016.
-  L. Yu, H. Tan, M. Bansal, and T. L. Berg. A joint speaker-listener-reinforcer model for referring expressions. CoRR, abs/1612.09542, 2017.
-  N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based r-cnns for fine-grained category detection. In European conference on computer vision, pages 834–849. Springer, 2014.
-  L. Zhu, Y. Chen, A. Yuille, and W. Freeman. Latent hierarchical structural learning for object detection. CVPR, 2010.
Appendix A Appendix
a.1 CiC details
To build the triplets comprising the communication contexts of CiC, we exploited the latent (bottleneck-derived) vector space of a Point-Cloud based AutoEncoder (PC-AE) , trained with chair-only objects of ShapeNet . Concretely, we used a PC-AE with small bottleneck (64D) to promote meaningful euclidean distances and after embedding all ShapeNet chairs in the resulting space, we computed their underlying 2-(euclidean)-nearest-neighbor graph. On this graph, we selected the chairs with the highest in-degree to ‘seed’ the triplet generation. For each of the 1K (seed) chairs, we considered it together with its two nearest neighbors from the entire shape collection to form a Hard triplet. Also, we considered it together with the two chairs that were closest to it but which were also more distant from it than the median of all pairwise distances, to form an Easy triplet. The above procedure gives rise to 2000 communication contexts when target vs. distractor information is ignored. However, to counterbalance the dataset while annotating these contexts in AMT, we ensured that each chair of a context was considered as a distractor and as a target, and that each resulting combination was annotated by at least 4 humans. Last, we note that when building the Hard triplets, we applied a manually tuned distance-threshold, to reject triplets that contained objects that were ‘too’ close: we found that about of chairs had a geometric duplicate that could vary only wrt. its texture.
a.2 Image and point-cloud pre-training
For the listeners and speakers we trained a PC-AE under the Chamfer loss  with a 128D bottleneck and point clouds with 2048 points extracted from 3D CAD models, uniformly area-wise. We also fine-tuned a VGG-16 pre-trained on ImageNet on a 8-way classification, with 36,632 rendered images of textureless 3D CAD models, taken from a single view-point. Concretely, we used images of the 8 largest object classes of Shape-Net (car, airplane, vessel, sofa, chair, table, lamp, riffle) and a uniformly random i.i.d. split of [90%, 5%, 5%] for train/test/val purposes. We fine-tuned the network for 30 epochs. During the first 15 epochs we optimized only the weights of the last (fc8) layer and during the last 15 epochs the weights of all layers. The attained test classification accuracy was . Last, to embed an image for the downstream listening/speaking tasks, we used the 4096D output activations of the penultimate (fc7) fully-connected layer.
a.3 Pre-processing utterances
We preprocessed the collected human utterances by i) lowercasing, ii) tokenizing by splitting off punctuation, iii) tokenizing by splitting superlative or comparative adjectives ending in -er, -est to their stem word, e.g. ‘thinner:’ [‘thin’, ’er’] and, iv) replacing tokens that appear once or not at all in a training split with a special symbol marking an unknown token (<UNK>). Furthermore, we ignored the utterances comprised by more than 33 tokens (99th percentile) and those for which the human listener in the underlying trial did not guess correctly the target. Last, we concatenated listener and speaker utterances from the same trial (in their order of formulation) by adding in the end of each but the last utterance a special symbol marking a dialogue: (<DIA>), e.g. [‘the’, ‘thin’, ‘chair’, <DIA>, ‘yes’].
a.4 Listeners details
For the listeners we used a uni-directional LSTM cell with
hidden units, the output of which was passed into a 3-layer MLP with [100, 50, 3] neurons that predicted the triplet’s classification logits. To the output of each hidden layer of the MLP, batch normalization
and a ReLU non-linearity was applied. The listeners’ word-embedding was initialized with a GloVe embedding pre-trained on the 6B Wikipedia 2014 corpus, and which was further fine-tuned during training. The PC-AE (128D) and VGG (4096D) latent vectors, that encoded each object, were passed as input to the LSTM when only one geometric modality was used. When the two modalities used together, the PC-AE codes were concatenated with the output of the LSTM, and the concatenated result was processed by the final MLP. In either case, we first re-embedded these geometric codes (100D) with 2 separate/single FC-ReLU layers (referred as ‘projection’ layers in the Main Paper Section 3). An overview of the proposed listener reflecting the overall design choices is given in Fig.5. We used dropout with 0.5 keep probability before the ‘projection’ layers with a drop-out mask that was the same for the objects of a given triplet. Separate dropout with 0.5 keep probability was applied in all input vectors of the LSTM (i.e. on the language tokens or the grounding geometric codes). Last, the ground-truth indicator vectors of each triplet were label-smoothed  by assigning 0.933 probability mass to the target and 0.0333 to the distractors (i.e. smoothing of 0.9).
Label smoothing yielded a mild performance boost of across all ablated listener architectures, in accordance with previous work . We note that we did not manage to improve the best attained accuracies by applying layer normalization  in the LSTM, or adversarial regularization  on the word-embedding. Dropout  was by far the most effective form of regularization for our listeners ([8-9]%), following by weight-regularization of the projection layers ([2-3]%). Finally, using a separate MLP to process the PC-AE codes, was slightly better than feeding them directly in the LSTM (after the tokens of each utterance were processed). However, grounding the LSTM with the PC-AE codes, and using the VGG codes in the end of the pipeline (either via pre-MLP concatenation or by feeding the latter in the LSTM) deteriorate significantly all attained results.
We ablated three architectures that used simultaneously images and point-clouds, word attention and different degrees of context (See Main Paper Section 3). The optimal Hyper-Parameters (HP) for each architecture are shown in Table 6. We did a grid search over the space of HP associated with each architecture separately. To circumvent the exponential growth of this space, we search it into two phases. First, we optimized the learning rate (in the regime of [0.0001, 0.0005, 0.001, 0.002, 0.004, 0.005]) in conjunction with the drop-out (keep probability) applied at the LSTM’s input, in the range [0.4-0.7] with increments of 0.05. Given the acquired optimal values, we searched for the optimal weight-regularization (in the range of [0.005, 0.01, 0.05, 0.1, 0.3, 0.9]) applied at the two projection layers, and label-smoothing ([0.8, 0.9, 1.0]). For these experiments we used a single random seed to control for the data splits with the object-generalization task. We note that for the Early-Context listener, using a single 1D convolutional layer to extract the grounding vector of each object, appeared to produce better results than using a single FC layer (or deeper alternatives). This single convolutional layer we used, converted the input signal to a LSTM-grounding vector for each object , with an
kernel and stride.
We trained the Baseline and the Combined-Interpretation for epochs and the Early-Context for . This was sufficient, as more training increased overfitting without improving the attained test/val accuracies. We halved the learning every 50 epochs, if the validation error was not improved in any of them. Namely, every 5 epochs we evaluated the model on the validation split in order to select the epoch/weights with the best accuracy. Because the Combined-Interpretation is sensitive in the input order of the object codes, we randomly permute them during training. We use the ADAM  () optimizer for all experiments.
a.5 Speaker details
To find good model parameters for an image-based speaker, we considered a hyper-parameter search on a literal variant. Similarly, to what we did in the ablations of listener variants we conducted a two-stage grid search given a single random seed and the object generalization task. At the first stage, we searched models varying: a) the hidden neurons of the LSTM ( or ), b) the initial learning rate ([0.0005, 0.001, 0.003]), c) the drop-out keep probability applied on the word-embeddings ([0.8, 0.9, 1.0]) and d) the dropout keep probability applied at the LSTM’s output ([0.8, 0.9, 1.0]). The two best performing models were further optimized by considering -weight regularization applied at the FC-projection layer (with values in [0, 0.005, 0.01]) and the dropout keep-probability applied before the FC-projection layer ([0.5, 0.7, 0.9 1.0]). The resulting optimal parameters are reported in Table 7.
|LSTM Size||Learning rate||-reg.||Word-Dropout||Image-Dropout||LSTM-out Dropout|
For the point-based speaker, we did a similar but more constrained hyper-parameter search as we did for the image-based speaker, by also considering its literal variant. Here, we fixed the drop-out applied the word-embeddings and to the LSTM’s output (0.8 and 0.9 keep-probability respectively) and ablated the remaining hyper-parameters as we did for the image-based speaker. We found the same configuration of parameters (Table 7) to be optimal for point-based models as well. Exception to this was the the dropout applied to the PC-AE codes before the FC-projection (no dropout at all was best in this case). Also, the point-based speakers needed more training to converge than the image-based ones (maximally 400 epochs vs. 300).
To do model selection for a training speaker, we used a pre-trained listener (with the same train/test/val splits) which evaluated the synthetic utterances produced by the speaker during training. To this purpose the speaker generated 1 utterance for each unique triplet in the validation set via greedy (arg-max) sampling every 10 epochs of training and the listener reported the accuracy of predicting the target given the synthetic utterance. In the end of training (300 epochs for image-based speakers vs. 400 for point-based ones), the epoch/model with the highest accuracy was selected.
We initially used GloVe to provide our speakers pre-trained word embeddings, as in the listener, but found that it was sufficient to train the word embedding from uniformly random initialized weights (we used the range [-0.1, 0.1]). We also initialized the bias terms of the linear word-encoding layer with the log probability of the frequency of each word in the training data , which provided faster convergence. We train with SGD and ADAM (
) and apply norm-wise gradient clipping with a cut-off threshold of 5.0. The training utterances have a maximal length of 33 tokens (99th percentile of the dataset). For any speaker we sampled utterances of the maximum training length. For thepragmatic speaker we sample and score utterances per triplet at test time (following Eq. 1 of Main Paper).
Point-cloud & image-based speaker
In preliminary experiments, we attempted to incorporate both geometric modalities: point-clouds and images in a speaker network, similarly to what we did for the best-performing listener. While, this resulted in a (literal) speaker model that could achieve higher neural-listener evaluation-accuracy than when either modality was used in isolation, we did not observe any improvement against the image-based speaker in AMT human-listener experiments.
We attempted three ways of ‘mixing’ the two modalities in a speaker. Namely, for each object of a communication context: a) providing the LSTM with the concatenation of its projected VGG code and its projected PC-AE code, b) same as a) but instead of concatenation, using the sum operator, c) first providing its PC-AE projected code followed at the next time step by its VGG one. We compared these approaches by using the optimal hyper-parameters for an image-based speaker and only vary the amount of dropout applied to the point-cloud before the projection layer ([1.0 0.8, 0.6] keep probability). In all cases, avoiding dropout was best. The final results for a single random-seed and the object-generalization task are reported in Table 8. We note that while the optimal speaker that used two modalities performed slightly better than the image-based speaker, per neural-listener evaluation, it did not improve the attained performance in preliminary experiments with of human listeners in AMT.
a.6 Further quantitative results
a.6.1 Listeners: context incorporation
In Table 9 we complement the results presented in the Main Paper at Table 1, by including two more sub-populations (’Negative’ and ’Split’). In Table 10, we repeat this study for listeners trained and tested on the language generalization task. ’Negative’ is a subpopulation of utterances that contain at least one word of negative content e.g. ’not’, ’but’ etc. and is comprised by of all test utterances. ’Split’ is smaller subpopulation ( of test data) that includes language the explicitly contrasts the target with the distractors e.g. ‘from the two that have thin legs, the one…’. We used an ad hoc set of search queries to find such utterances among the test set and found that the Early-Context architecture does perform noticeably better on these utterances. However, given the low occurrence of such cases, the resulting effects were not significant and we decided the gains of Early-Context architecture were not worth the increase in model complexity and rigidity with respect to context size.
a.6.2 Listeners: part-lesion
|Single Part Lesioned||Single Part Present|
We complement Table 3 of the Main Paper, with a similar study (Table 11) where we ablate our neural listeners with regards to their sensitivity in referential utterances based on object parts, when both geometric modalities are used. We have observed that the PC-AE attempts to reconstruct (decode) noisy but complete models, even when the input is a partial, which could explain the gains seen in Table 11 compared to Table 3 when lesioning parts.
|Class||entire||with part||without part|
|chair||7.1||8.0 (77%)||4.7 (21%)|
|bed||6.4||7.0 (26%)||5.3 (48%)|
|lamp||7.3||11.0 (20%)||5.9 (37%)|
|sofa||10.1||11.0 (72%)||5.9 (15%)|
|table||6.6||8.0 (40%)||4.9 (42%)|
|average||7.6||9.3 (39.5%)||5.5 (35.5%)|
a.6.3 Speakers: length penalty and listener awareness
To find the optimal length-penalty value (, Main Paper Eq.1) for image-based literal and a context-unaware speaker variants, we used our best-performing listener to simultaneously score and evaluate the utterances produced by the speakers for different values of (Fig. 5(a)). The best performing length penalty for a context-unaware speaker is , and for a literal . Given the optimal values, for these models we show the effect of using different degrees of listener-awareness () in Fig. 5(b). It is interesting to observe that even the context-unaware speaker can generate utterances that an evaluating listener can find them very discriminative, as long as is allows to rank them.
In Fig. 7 we demonstrate the effect that the relative (training) size of the evaluating listener vs. the ’internal’ listener used by a pragmatic speaker has for the evaluating accuracy, for two values of . In either case we observe a slow decline in evaluating accuracy as the training size for the evaluating listener increases (from 0.5 to 0.9) and consequently the training size for the ’internal’ listener decreases (from 0.5 to 0.1).
a.6.4 Understanding out-of-class reference
We complement the Table 5
with the standard-deviations of the underlying accuracies in Table13. We also report simple statistics regarding the underlying transfer classes in Table 12. We note that the transfer learning accuracies acquired by listeners operating with both point-clouds and images for these experiments were significantly lower ( on average). We hypothesize that this is due to the fact that our (chair-trained) listener models that utilize point-clouds, rely on a pre-trained single-class PC-AE, unlike the pre-trained VGG (image encoder) which was fine-tuned with multiple ShapeNet classes. Also, for these experiments, [~1% ~7%] (depending on the transfer class) of the tokens were not in the chair-vocabulary, and we chose to ignore them i.e. treat them as white-space. Last, per Table 12 in all transfer classes the with-part population contains quite larger utterances than the without-part (9.3 vs. 5.5 on average) and that even in the case of lamps, arguably the most dissimilar category from chairs, of the collected utterances are in the known population.
Each game consisted of 69 trials (unique triplets) and participants swapped speaker and listener roles with the conclusion of each trial. The game’s interface is depicted in Figure 12. Participants were allowed to play multiple games, but most participants in our dataset played exactly one game (81% of participants). The most distinctive words in each triplet type (as measured by point-wise mutual information) are shown in Table 14).