reference
From sense to reference
view repo
Reference is a crucial property of language that allows us to connect linguistic expressions to the world. Modeling it requires handling both continuous and discrete aspects of meaning. Data-driven models excel at the former, but struggle with the latter, and the reverse is true for symbolic models. This paper (a) introduces a concrete referential task to test both aspects, called cross-modal entity tracking; (b) proposes a neural network architecture that uses external memory to build an entity library inspired in the DRSs of DRT, with a mechanism to dynamically introduce new referents or add information to referents that are already in the library. Our model shows promise: it beats traditional neural network architectures on the task. However, it is still outperformed by Memory Networks, another model with external memory.
READ FULL TEXT VIEW PDFFrom sense to reference
Language combines discrete and continuous facets, as exemplified by the phenomenon of reference (Frege, 1892; Abbott, 2010): When we refer to an object in the world with the noun phrase the mug I bought, we use content words such as mug, which are notoriously fuzzy or vague in their meaning (Van Deemter, 2012; Murphy, 2002) and are best modeled through continuous means (Boleda and Herbelot, 2016). Once the referent for the mug has been established, however, it becomes a linguistic entity that we can manipulate in a largely discrete fashion, retrieving it and updating it with new information as needed (Remember the mug I bought? My brother stole it! Kamp and Reyle, 1993). Put differently, managing reference requires two distinct abilities:
The ability to categorize, that is, to recognize that different entities are equivalent with regard to some concept of interest (e.g. two mugs, two instances of the “things to take on a camping trip” category; Barsalou, 1983). This implies being able to aggregate seemingly diverse objects.
The ability to individuate, that is, to keep entities distinct even if they are similar with regard to many attributes (e.g. two pieces of pink granite that were collected in different national parks). This implies being able to keep seemingly similar things apart.
Data-driven, continuous models are very good at categorizing, but not at individuating, and the reverse holds for symbolic models (Boleda and Herbelot, 2016). Our long-term research goal is to build a continuous computational model of reference that emulates discrete referential mechanisms such as those defined in DRT (Kamp and Reyle, 1993); here we present initial work towards that goal, with two specific contributions.
Our first contribution is an experimental task (and associated dataset), cross-modal entity tracking, that tests the ability of computational models to refer successfully in a setting where they are required to both categorize and individuate entities. The task presents different entities (represented by pictures) repeatedly, each time with a different, linguistically conveyed attribute (e.g. a given mug is presented once with the attribute bought and once with stolen). The category label (“mug”) is not given at exposure time. The task is to choose the picture of the entity that corresponds to a linguistic query that combines category information with attribute information (e.g. simulating “the mug that was bought and stolen”), among the set of all the entities presented in a given sequence. The sequences in each datapoint of our dataset contain confounders that make the task challenging: Other entities with the same category but only one matching attribute (e.g. a different mug that was bought and stored), and other entities with the same attributes but a different category (e.g. a chair that was bought and stolen). Therefore, the task requires models to 1) correctly categorize entities, recognizing which images belong to the category in the query (something that is hard for symbolic models), 2) individuate and track them, being able to distinguish among different entities based on visual and linguistic cues provided at different time steps (something that is hard for continuous models).
In DRT terms (Kamp and Reyle, 1993), each entity exposure either introduces a new discourse referent or updates the representation of an old referent with new information. To solve the task successfully, the model needs to decide, for each incoming exposure, whether to aggregate it with a previously known referent (in DRT, this means introducing an equation between two referents), or to treat it as a new referent.
Our second contribution is a neural network architecture with a module for referent representations: DIstributed model of REference, DIRE. DIRE uses the concept of external memory
from deep learning
(Joulin and Mikolov, 2015; Graves et al., 2016)to build an entity library for an exposure sequence that conceptually corresponds to the set of DRT discourse referents, using similarity-based reasoning on distributed representations to decide between aggregating and initializing entity representations. In contrast to symbolic implementations of DRT
(Bos, 2008), which manipulate discourse referents on the basis of manually specified algorithms, DIRE learns to make these decisions directly from observing reference acts using end-to-end training. We see our paper as a first, modest step in the direction of data-driven learning of DRT-like behavior, and are of course still far from learning anything resembling a fully fledged DRT system.Imagine an office, with a desk where there are three mugs and other objects. Adam tells Barbara that he just bought two of the mugs and he particularly likes the one on the right. Later they are in the kitchen, and Adam, busy preparing coffee, asks Barbara: “Remember the mugs I bought? Could you please bring the one I like?”. To pick the right mug from the office, Barbara must correctly categorize the objects on the desk (identify which ones are mugs) and individuate them via their properties (singling out the one Adam is asking for). Also, she must combine visual and linguistically conveyed properties of the objects: Visual properties tell her which ones are mugs, the properties that Adam told her about help her pick the right one. Our cross-modal entity tracking task emulates this kind of situation. Our current study uses a simplified version of the task that allows us to carefully control all the variables involved.
We operationalize the task as one of pointing to real-life pictures of objects. Figure 1 shows a simplified example. We sample six entities belonging to two categories (in the example, where only three entities are shown, barkeepers and soldiers). Each entity is represented by one image (that is, barkeeper A is always represented by the same image). We also sample different attributes, which are compatible with both categories (in the example, “instructed”, “evaluated”, “amused”). In the exposure phase, we present each entity (image) twice at different time steps, each time with one of the attributes. In this phase, the category of the entity in the image is not given to the model, only the images are. At query time, we use a linguistic query with one category (e.g., “barkeeper”) and two attributes (e.g., “instructed and evaluated”). The task is to retrieve the image of the entity that corresponds to the query. To solve it, it is not enough to rely on categorization or object labeling (in the actual task, there are always three entities belonging to the category in the query), nor is it enough to rely on attribute information (there will always be three entities for each attribute, and two for the combination of attributes in the query). Note that one important simplification we make, with respect to a real-life scenario, is that an entity is always represented by the very same image. The current setup is nevertheless already very challenging for current models, as the experiments below will show. Indeed, to succeed in the task a model must correctly associate the category in the query with images of the right object, it must develop a mechanism to index entities based on the images representing them, and it must learn to correctly accumulate over time the different attributes to be stored with each entity.
The task is related to coreference resolution (see Poesio et al., 2017, for a recent survey), but focuses on identifying language-external objects from images rather than mentions of a referent in text; to Visual Question Answering (Antol et al., 2015), but it cannot be solved with visual information alone; and to Referring Expression Generation (Krahmer and Van Deemter, 2012), but involves identification rather than generation.
We have constructed a dataset for the task containing 40k sequences for training, 5k for validation and 10k for testing.111Available at http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/dire.
It is assembled on the basis of 2k object categories with 50 ImageNet
222http://imagenet.stanford.edu images each, sampled from a larger dataset (Lazaridou et al., 2015). These are natural images, which makes the task challenging. The object categories given in the queries are those specified in ImageNet.We build a set of linguistic attributes for each object by first extracting the 500 most associated, and thus plausible, syntactic neighbors for the category according to the DM resource (Baroni and Lenci, 2010). This excludes nonsensical combinations such as repair:dog. We further retain only (relatively) abstract verbs taking the target item as direct object.333We use the base form of verbs rather than the past participle for simplicity. This is because (a) concrete verbs are likely to have strong visual correlates that could conflict with the image (cf. walk dog); and (b) referential expressions routinely successfully mix concrete and abstract cues (e.g., the dog I own). We remove all verbs with a score over 2.5 (on a 1–5 scale) in the concreteness norms of Brysbaert et al. (2014).
We then construct each sequence as follows. First, we sample two random categories, and three random entities (distinct images) for each category (total: six entities). We then sample three attributes compatible with both categories, giving us three attribute sets of size two (a1+a2, a1+a3, a2+a3) to associate with the entities. We create a completely balanced set of exposures by randomly pairing up each of the three entities of each category with each of the three attribute sets. Since this process gives us two exposures for each entity (one with the first attribute, one with the second), it yields a sequence of twelve exposures. The query is a random combination of a category and two attributes, guaranteed to match exactly one entity.
The core novelty of our model, DIRE (for DIstributed model of REference), is a method to dynamically construct an entity library, conceptually inspired in 1) the DRSs of DRT,444DRSs represent many types of information; as explained above, here we focus on entity-related information. and 2) Joulin and Mikolov (2015) and Graves et al. (2016), who simulate discrete memory-building operations in a differentiable continuous setup.555While we developed DIRE, Henaff et al. (2016) proposed a similar architecture; we leave a comparison to future work. The model is a feed-forward network enhanced with a dynamic memory (the entity library), as well as mechanisms to interact with it.
The entity library is updated after reading an input exposure by either creating a new entity slot for the exposure, or adding the exposure contents to an existing entity slot. This decision is based on the similarity between the current input and the entities already in the library. This generic mechanism (Section 3.1) can be applied in any setting that accumulates information about entities over time. We explain how we use it for our cross-modal tracking task in Section 3.2.
Figure 2 depicts the entity library building mechanism. The input to the model is a set of subsequent exposures
which are represented by vectors
. At the -th exposure, the entity library is updated to state as follows. The first exposure vector is added to the entity library as is (Equation 1). For , we obtain a similarity profile by taking its dot product with the entity vectors in the library (Equation 2; note that has dimensions):(1) | |||
(2) |
The maximum similarity to an existing entity, , cues whether is an instance of an entity that has already been encountered before. We transform into
, the probability that exposure
corresponds to an “old” entity, as follows (with the scalar parameters shared across all exposures for ):(3) |
The entity library is updated by “soft insertion” (Joulin and Mikolov, 2015) of the current exposure vector into the library. Concretely, we add the vector to each entity in the library, weighted by the probability that the current exposure is an instance of that entity. For the existing entities, this probability is obtained by distributing the
mass across them, according to their probability of being the matching entity, conditional on the exposure being old. The latter probability is estimated by softmax-normalizing the similarity profile
from above. The probability that is new is obviously . This results in the following distribution, where stands for concatenation:(4) |
Note that has one value more than the current number of stored entities, expressing the probability that the current exposure instantiates a new entity.
The entity library is then updated as:
(5) | |||
(6) |
Thus, we insert a 0 vector of the same dimensionality as the vectors at the end of the library, initializing a blank slot to store a new entity. As a consequence, the library in its end state will always contain as many entity vectors as exposures. However, we expect those inserted for exposures of old entities (that is, when ) to be near zero, and removable from the library along the lines of Graves et al. (2016).
We use DIRE for the cross-modal entity tracking task as follows (see Figure 3). Given pre-trained image and verbal attribute representations, we first derive a multimodal representation for each exposure and update the entity library as explained in Section 3.1. The linguistic query is mapped to the same multimodal space where entities live, and the most relevant entity is retrieved. Finally, the images the model has to choose from (candidate set) are also mapped to multimodal space, and the correct answer is picked based on their similarity with the retrieved entity. We share the same projection across all images (in the exposures as well as in the candidate set at query time), a single projection for the verbal attributes (in the exposures and in the query), and a matrix for the category name in the query. Details on each of the steps follow.
Exposures are linearly mapped to a multimodal space combining visual and linguistic information, building the vector by separately embedding each image vector and attribute vector using a matrix for images666Size , where is the size of the image vector and the multimodal dimensionality. and a matrix for linguistic attributes777Size , where is the size of the attribute vector. Both matrices, V and A, are learned. and adding up the result (Equation 7). Storage takes place by feeding the vectors sequentially to the entity library (cf. Section 3.1).
(7) |
To select the best entity match for the query, we compute a “soft retrieval” operation inspired by Sukhbaatar et al. (2015). To query the entity library, we first map the query (a linguistic referring expression, consisting of one noun and two attributes) to multimodal space. We embed the attribute vectors , with the matrix learned during storage and the noun vector with matrix , and we sum the result (Eq. 8). The query vector lives in the same space as the entity vectors, which enables similarity computations.
(8) |
We then retrieve the entity representation that matches the query, by first computing the similarity of the mapped query to each entity vector through a normalized dot product, (Eq. 9), and then using those similarities as weights to perform a “soft retrieval” of the entity that best matches the query, summing up the vectors in the entity library multiplied by (Eq. 10). Note that if only one entity is significantly similar to the query (so that the corresponding entry in the similarity profile tends to 1, while all other entries tend to 0), this is equivalent to retrieving that entity.
(9) | |||
(10) |
Finally, we use the retrieved entity representation, , to pick among the images that represent the entities. We map the candidate image vectors to multimodal space using the same visual matrix
as above. We compare the query with each of the images using a dot product, again obtaining a similarity profile, that we softmax-normalize to obtain the final probability distribution that will give us the candidate image, namely, the one corresponding to the argmax of the probability distribution. Note that we need a probability distribution because we use a cross-entropy cost function when training the model.
The whole architecture is differentiable, allowing end-to-end training by gradient descent; in particular, the cross-modal mapping is learned as the model learns to refer.888Note that the input vectors for images are only visual, and those for nouns and attributes are only textual. At the same time, it emulates discrete-like operations like insertion and retrieval of entity representations, that, in frameworks such as DRT, are performed entirely in symbolic terms, and are manually coded in the DRT system Boxer (Bos, 2008). This has the advantage that the entity representations can be continuous, enabling their matching with continuous representations of language as well as cross-modal reasoning (for instance, using cup for something that a different speaker calls mug, or mixing visual and linguistically conveyed information). The model is rather parsimonious, with parameters limited to three mapping matrices (, , ) and the bias and weight terms for .
Images are represented by 4096-dimensional vectors produced by passing images through the pre-trained VGG 19-layer CNN of Simonyan and Zisserman (2015) (trained on the ILSVRC-2012 data), and extracting the corresponding activations on the topmost fully connected layer.999We use the MatConvNet toolkit, http://www.vlfeat.org/matconvnet. Linguistic representations are given by 400-dimensional cbow embeddings from Baroni et al. (2014), trained on about 2.8 billion tokens of raw text. We map to a 1K-dimensional multimodal space. The parameters of DIRE
are estimated by stochastic gradient descent with 0.09 learning rate, 10 minibatch size, 0.5 dropout probability, and maximally 150 epochs
(here and below, hyperparameter values as in Baroni et al.,
2017).As competitors, we train standard feed-forward (FF) and recurrent (RNN
) networks which have no external memory, using two 300-dimensional hidden layers and sigmoid nonlinearities. We also implement the related Memory Network model
(MemN; Sukhbaatar et al., 2015). Like DIRE, MemN controls a memory structure, but stores each input exposure separately in the memory. At the same time, MemN can perform multiple “hops” at query time. Each hop consists in soft-retrieving a vector from the memory, where the probing vector is the sum of the input query vector and the vector retrieved in the previous hop (null for the first hop). Conceptually, DIRE attempts to merge different instances of the same entity at input processing time, whereas MemN stores each piece of input separately and aggregates relevant information at query time. MemN can thus use the query to guide the search for relevant information. At the same time, it does not optimize the way in which it stores information in memory. Another difference with DIRE is that MemN uses two sets of mapping matrices: One to derive the vectors used at query time, the other for the vectors used for retrieval. We employ the same hyperparameters for MemN (also multimodal vector size) as for our model.Table 1 shows that DIRE outperforms the standard networks (FF and RNN) by a large margin, confirming the importance of a discrete memory structure for reference tracking. If we make the MemN architecture completely comparable to our model (with one matrix and one hop, MemN-1m-1h), our model achieves higher results (0.64 for DIRE-1m, 0.59 for MemN-1m-1h), which indicates that the basic architecture of the model holds promise. However, MemN outperforms DIRE when using two matrices, two hops (0.67 MemN-2m-1h/MemN-1m-2h vs. 0.65 DIRE-2m), or both (0.69 MemN-2m-2h). For MemN, this seems to be the upper bound, as increasing to three hops greatly harms results (see last row).
Further analysis suggests that DIRE successfully addresses the two challenges set out in the introduction: (i) It learns to categorize: Only for 8% of the datapoints does the model pick an image of the wrong category, and these are cases where confounders belong to visually similar or related categories to the target (cottage-chalet, youngster-enthusiast, witch-potion). It is worth noting that the model learns to categorize directly from reference acts: At exposure time, the image is not provided with a category label, so the model needs to induce the category as part of solving the reference task. (ii) DIRE also learns to individuate by combining visual and linguistically-conveyed information: The similarity of the exposure to the query goes to near-zero when the attribute is wrong, even when the category is the same. Together, these two properties make it able to ground linguistic expressions to entities represented in images. However, the entity creation mechanism still needs to be fine-tuned, as currently DIRE creates a new entity vector for nearly every exposure. More work is needed for this crucial part of the model.
Baseline | Standard models | DIRE | MemN | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Random | 0.17 | FF | 0.27 | 1m | 0.64 | 2m | 0.65 | 1m-1h | 0.59 | 2m-1h | 0.67 |
RNN | 0.28 | 1m-2h | 0.67 | 2m-2h | 0.69 | ||||||
1m-3h | 0.30 | 2m-3h | 0.30 |
Providing a continuous model of reference that can emulate discrete reasoning about entities is an ambitious research programme. We have reported on work in progress on such a model, DIRE, which, unlike Memory Networks, and emulating formal approaches such as DRT within an end-to-end neural architecture, is designed to make decisions as to how to store the information at input processing time, in a way that aids further reasoning, namely, organizing it by entity. Results suggest that merging complementary aspects of DIRE and MemN could be fruitful. We have also presented a new task, cross-modal entity tracking, that tests the categorization and individuation capabilities of computational models, and a challenging dataset for the task.
Our project is related to several areas of active research. Reference is a classic topic in philosophy of language and linguistics (Frege, 1892; Abbott, 2010; Kamp and Reyle, 1993; Kamp, 2015); emulating discrete aspects of language and reasoning through continuous means is a long-standing goal in artificial intelligence (Smolensky, 1990; Joulin and Mikolov, 2015), and recent work focuses on reference (Baroni et al., 2017; Herbelot, 2015; Herbelot and Vecchi, 2015); grounding language in perception (Chen and Mooney, 2011; Bruni et al., 2012; Silberer et al., 2013), as well as reference and co-reference (Krahmer and Van Deemter, 2012; Poesio et al., 2017) are important subjects in Computational Linguistics. Our programme puts these different strands together.
We thank Angeliki Lazaridou for help producing the visual vectors used in the paper. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 715154; AMORE); EU Horizon 2020 programme under the Marie Skłodowska-Curie grant agreement No 655577 (LOVe); ERC 2011 Starting Independent Research Grant n. 283554 (COMPOSES); DFG (SFB 732, Project D10). We also gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs to U. Trento and U. Pompeu Fabra. This paper reflects the authors’ view only, and the EU is not responsible for any use that may be made of the information it contains.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, Lisbon, Portugal, pp. 22–32. Association for Computational Linguistics.