Contextual Media Retrieval Using Natural Language Queries

02/16/2016 ∙ by Sreyasi Nag Chowdhury, et al. ∙ Max Planck Society 0

The widespread integration of cameras in hand-held and head-worn devices as well as the ability to share content online enables a large and diverse visual capture of the world that millions of users build up collectively every day. We envision these images as well as associated meta information, such as GPS coordinates and timestamps, to form a collective visual memory that can be queried while automatically taking the ever-changing context of mobile users into account. As a first step towards this vision, in this work we present Xplore-M-Ego: a novel media retrieval system that allows users to query a dynamic database of images and videos using spatio-temporal natural language queries. We evaluate our system using a new dataset of real user queries as well as through a usability study. One key finding is that there is a considerable amount of inter-user variability, for example in the resolution of spatial relations in natural language utterances. We show that our retrieval system can cope with this variability using personalisation through an online learning-based retrieval formulation.



There are no comments yet.


page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to the widespread deployment of visual sensors in consumer products and Internet sharing platforms, we have collectively achieved a quite detailed visual capture of the world in space and time over the last years. In particular, mobile devices have changed the way we take pictures and new technology like life-logging devices will continue to do so in the future. With efficient search engines at our aid, viewing images and videos of unknown and distant places is just a few clicks away. These search engines do not allow for complex, natural language queries that include spatio-temporal references and they also largely ignore the users’ local context.

“What building is to the left of MPI-SWS?”     
“What is in front of bus terminal?”     
“What is near MPI-INF?”     
“What did this place look like in December?”     
Figure 1: Sample queries and retrieved images of our contextual media retrieval system Xplore-M-Ego.

Similar to how mobile devices have changed the way we take pictures, we ask how media search should be transformed to make use of the rich context available at query time. What if we quickly want to know what is behind the building in front of us? What if we want to know what a particular cafe looks like to quickly locate it in a busy market area? What if we want to see what our new neighborhood looks like in winter? Our approach makes use of the user’s ever-changing context to retrieve results of a spatio-temporal query on a mobile device. The user is enabled to make references to his/her changing environment by allowing for queries in natural language. We have named our system Xplore-M-Ego (read “Explore Amigo”) – which stands for Exploration(Xplore) of Media(M) Egocentrically(Ego)".

2 Related Work

Previous research addressing the problems of media retrieval and machine understanding of natural language queries can be broadly classified into three groups.

Spatio-temporal Media Retrieval: Spatio-temporal media retrieval is the browsing of media content captured in different geographical locations at various times in the past. Snavely et al. [17] proposed Photo tourism that constructs a sparse 3D geometric representation of the underlying scene from images. Using this system users could move in 3D space from one picture to another. To address challenges in the construction management industry, Wu and Tory [22] designed PhotoScope. It is an interactive tool to visualize spatio-temporal coverage of photos in a photo collection, which users can browse with space, time and standardized content specifications. Tompkin et al. [20] developed Vidicontexts that embeds videos in a panoramic frame of reference (context) and enables simultaneous visualization of videos in different foci. A similar system, VideoScapes [19] was implemented as a graph with videos as edges and portals (automatically identified transition opportunities) as vertices. When temporal context was relevant, videos were temporally aligned to offer correctly ordered transitions.

In contrast to these methods, our approach implements egocentrism by taking users’ context (geographical location and viewing direction) into account, and interfaces with the users through natural language queries.

Natural Language Query Processing:

Successful answering of a natural language question by machines requires understanding its meaning, which is often realizable by a semantic parser that transforms the question into its formal representation. Traditional approaches to semantic parsing used supervised learning by training on questions with costly manually annotated logical forms 

[23, 21]. Modern approaches use more scalable techniques to train a semantic parser with more accessible textual question-answer pairs [2, 10, 1]. Malinowski and Fritz [14] proposed an architecture for question-answering based on real-world indoor images. They extended the work of Liang et al. [10] to include subjective interpretations of scenes. They also identified challenges that holistic architectures have to face, such as different frame of reference in spatial relations or ambiguities in the answers [14, 15].

Our work differs in that we target a dynamic and egocentric environment in contrast to static geographical/job/image data.

Media Retrieval Using Natural Language Queries: Previous research on media retrieval using natural language queries varies considerably in the methods used to process the natural language utterances. Lum and Kim [11] presented a method that matches semantic network representations of queries with those of natural language descriptions of media data (manually annotated). Kucuktunc et al. [7]

proposed a pattern matching approach based on Part-of-Speech (POS) tags. Other approaches are based on RDF-triples

[6] and SPARQL queries [4]. Contrary to these research threads, our work does not involve any human annotations or additional processing steps for extracting descriptions of entities from images and videos. Instead, we extract media content simply based on its meta data such as geographical location (GPS coordinates) and textual questions.

Prior research also looked into media retrieval with natural language questions containing spatial relations. Tellex and Roy [18] explored spatial relations in surveillance videos by a classification task which handles two prepositions, “across” and “along”. Lan et al. [8] used structured queries that consists of two objects linked by a spatial relations chosen from a restricted set of spatial prepositions. In contrast, our media retrieval architecture aims to operate on rich natural language questions that liberate from any artificially imposed restrictions, such as fixed structure of the questions or a restricted vocabulary.

To the best of our knowledge, none of the previous works ventured into contextual media retrieval by taking into account the user’s current location and viewing direction. The introduction of egocentrism and natural language queries in architectures developed for browsing large media collections have many practical applications. Not only does it open another unexplored dimension for media retrieval (vis-à-vis, “egocentrism”), but also aids in human interaction with the computer.

3 Contextual Media Retrieval

Our contextual media retrieval architecture allows users to explore a collective media collections in a spatio-temporal context through natural language questions such as “What is there in front of the university bus terminal?”, “What is there to the left of the campus center?”, “What happened here five days ago?”, “What did this place look like in December?” etc. In the following, we show how we formulate our architecture. Particular attention is payed on how to cope with the user’s dynamic context and spatial references in natural language questions. We further describe how we collect a data set which relate to our initial motivation of building a Collective Visual Memory.

natural language question




Collective Visual Memory

image(name,GPS_lat,GPS_lon,month) video(name,GPS_lat,GPS_lon,month)

dynamic world

static world



cafe(name,GPS_lat,GPS_lon) atm(name,GPS_lat,GPS_lon)



logical form



what is there on the right of the campus center?

Semantic Parsing








Figure 2: Our probabilistic graphical model: a question in natural language (a query from a user) is automatically mapped into a logical form by the semantic parser. It is next interpreted with respect to a world to give retrievals . The world consists of a static part concretized as a database of geographical facts, and a dynamic part storing media content from Collective Visual Memory and the user’s spatio-temporal context (extracted from his/her metadata).

3.1 Learning-based, Contextual Media Retrieval by Semantic Parsing

In this section, we describe how we approach learning-based, contextual media retrieval from natural language queries by a semantic parsing approach. First we describe the employed semantic parser architecture (inspired from  Liang et al. [10]) and show how to extend it towards a contextual media retrieval task. The probabilistic model of our architecture in shown in Figure 2. A question (uttered by a user) is mapped to a latent logical form , which is then evaluated with respect to a world (database of facts), producing an answer . The world consists of (a static database of geographical information) and (a dynamic database which stores user metadata and information about media files in the Collective Visual Memory). The logical forms are represented as labeled trees and are induced automatically from question-answer pairs.

3.1.1 Question-Answering with Semantic Parsing

We build our approach on a recently proposed framework for semantic parsing [10] that has been shown to be able to answer questions about facts like geographical data and is trained solely on textual question-answer pairs. For example, the approach is capable of answering question like “What are the major cities in California?” with {San Fransisco, Los Angeles, San Diego, San Jose} as an answer. In the semantic parsing framework (left part of Figure 2 labeled Semantic Parsing and Interpretation): ‘parsing’ translates a question into its logical form , and ‘interpretation’ executes on the dataset of facts (word ) producing its denotation - an answer. Parameters

are estimated solely on the training question-answer pairs

with an EM algorithm maximizing the following posterior distribution:


where denotes a training set, is if a condition holds, and otherwise. The posterior distribution marginalizes over a latent set of valid logical forms . At test time, the answer is computed from the denotation that maximizes the following posteriori:


The logical forms follow a dependency-based compositional semantics (DCS) formalism [10] that consists of trees with nodes labeled with predicates and edges labeled with relations between the predicates. DCS is mainly introduced to efficiently encode feasible solutions.

The underlying principle of the parsing is built on two components – lexical semantics and compositional semantics. Lexical semantics learns a mapping from textual words into pre-defined predicates, and uses hand-designed lexical triggers that map specific parts-of-speech into a set of candidate predicates. Compositional semantics establishes relations between the predicates to generate the logical forms (DCS trees). The distribution over logical forms is modeled by a log linear distribution

, where the feature vector

measure compatibility between the question and a logical form . We perform a gradient descent scheme in order to optimize for parameters . For a more detailed exposition of the semantic parser and parameter optimization in these models, we refer the reader to Liang et al. [10]. In the following, we discuss our decomposition of the world into two parts: and (Figure 2).

3.1.2 Static and Dynamic Worlds

The existing works that use such a semantic parser are based on a static environment [10, 1, 14]. In contrast to these, in our scenario a human user (the source of the query - in Figure 2) relocates herself in space and time in a continuously changing environment. The pool of media content – Collective Visual Memory – also grows as new media is added (by multiple users - crowd icon in Figure 2). Such an environment leads us to our decomposition of the world into a static part , which consists of geographical facts such as names of buildings or theirs GPS coordinates and inherits all the properties of the aforementioned previous works, and a dynamic and egocentric part (Figure 2).

The dynamic world decomposes even further into that stores media metadata (timestamp, GPS coordinates) and is updated with continuously growing Collective Visual Memory, and that models the user’s context by storing her metadata (GPS coordinates, viewing direction). The latter is set anew for each query before it is fed into the semantic parser. Such representation renders the world static to the semantic parser although it is constantly changing.

3.1.3 Modelling User’s Context

The user’s context is modelled through predicates person(LAT,LON,VIEW_DIR) where LAT, LON, VIEW_DIR represent the user’s current latitude, longitude and viewing directions respectively. These predicates are stored in the dynamic database of user metadata which is updated at query time for each query.

Understanding egocentric spatial relations in natural language questions has for long intrigued the research community and forms a separate research area by itself [16, 8, 3, 13]. In our work, we approach ambiguity in the frame of reference [15] by defining predicates that resolve the spatial relations “front of”, “behind”, “left of” and “right of” based on the geomagnetic reference frame as well as the user-centric reference frame. Each of these spatial relations are modeled as two-argument predicates such as frontOf(A,B), behind(A,B), leftOf(A,B) and rightOf(A,B) where A denotes the GPS coordinates of the entity in question (extracted from ) and B denotes the GPS coordinates of the media files in .

Similarly, the temporal references in questions (e.g. “what happened here five days ago?”, “how did this place look like in December?”) are modelled through the predicate day(X) where X is the referenced time-stamp (for example 20150511). These are resolved by mapping to trigger a predicate view(A), where A is the list of media files having the same time-stamp as that in the predicate day().

However, it is difficult to understand the hidden intent for contextual questions which includes an egocentric reference frame. This is because humans do not adhere to any consistent reference frame. They may consider their own physical “left hand” for “left of” or the physical “left side” of the geographical entity. The first possibility is tackled by programming the semantic parser to follow the geomagnetic reference frame. Then, with the assumption that the direction in which the human user faces is the local north, the spatial reference in the query is modified in a pre-processing step. This is explained in Figure 3 – if the user faces east and queries for “What is there in front of postbank?”, the question would be changed during pre-processing to “What is there on the right of postbank?”. The semantic parser would predict answer for this changed question. For simplicity we have narrowed down to only the four basic heading directions - north, south, east and west.

Figure 3: Modification of spatial reference in query for integrating egocentrism to media retrieval

3.1.4 Media Retrieval as Answers

In contrast to previous work on question answering [10, 14], we desire to retrieve media as answers to natural language questions instead of textual information. This can be modeled by generating references to media files as denotations of logical forms. For example, the question “What is there on the right of the campus center?” would be transformed into the following symbolic representation (after training the system):
answer(A, (rightOf(A,B), const(B, ‘campus_center’)) with its denotation {‘image12’, ‘image58’, ‘image234’, …}, where ‘image12’, ‘image58’ and so on are references to images with visual contents depicting geographical entities on the right of the university campus center. The answers are predicted with respect to a world which consists of the name, timestamp, GPS coordinates and the month of the media file acquisition. Once the denotations of a logical form are predicted, the actual media files are extracted from the Collective Visual Memory (physically, a file system storing all captured media). These extracted media files are then returned to the user. Figure 1 shows examples of retrieved results.

3.2 Data Collection

To enable the spatio-temporal exploration of a certain geographic area we inherently require a database which record physical features on the ground along with their types (e.g. building, cafe, highway, etc.), names and GPS locations – this constitutes our static world . To support media retrieval we need a database of images and videos rich with metadata. We also require natural language queries paired with corresponding media content as retrievals for training and testing our query-retrieval model. In the absence of a suitable benchmark, we needed to record our own data set, where the geographical facts () are obtained from OpenStreetMap, and the Collective Visual Memory and query-retrieval pairs are collected from regular users.

3.2.1 Geographical Facts

OpenStreetMap [5] is a freely-available and well-documented collection of geographical data. The topological data structure used has four basic elements or data primitivesnodes, ways, relations and tags. Physical entities on the ground such as buildings, highways, ATMs, banks, restaurants etc. are registered in the map database in terms of these data primitives.

(a) A section of the OpenStreetMap view of the university campus

nodeid = “344240596” visible = “true” version = “6” changeset = “9208001” timestamp = “2011-09-04T11 : 43 : 28Z” user = “arnhar” uid = “495739” lat = “81.24279” lon = “35.18783” tagk = “amenity” v = “bus_stop”/ tagk = “name” v = “Universität Mensa”/ /node

(b) XML rendition of the physical entity shown in Figure 3(a)


(c) Entry in corresponding to the entity in Figure 3(b)
Figure 4: Example of OpenStreetMap data

In our study, we restrict the spatial scope of our system to a university campus. Figure 3(a) shows the map view of a part of the university campus depicting a physical entity on the ground – a bus-stop named “Universität Mensa”. The XML rendition of this part of the map (available for download) in shown in Figure 3(b). We used information such as the type of the physical entity (e.g. building, cafe, highway etc.), their names, and their GPS coordinates as our static database of facts () (Figure 3(c)).

3.2.2 Collective Visual Memory

Participants were asked to capture media (images and videos) at various locations of the university campus for a month using their mobile devices. In total our instance of the Collective Visual Memory consists of 1025 images and 175 videos. Metadata such as GPS coordinates and time-stamp registered with each media file constitute our media database .

The process of media acquisition was coupled with the collection of natural language questions. Participants were instructed to formulate a question and capture the photo(s)/ video(s) that they would expect as the corresponding answer. 1000 questions-answer pairs with spatial references were collected (one question could have multiple answers). Question-answer pairs with temporal references could not be collected because of the trivial infeasibility of capturing events from the past. The data set was randomized and divided into two parts – 500 train questions and 500 test questions. To introduce sufficient amount of variations in natural language we chose participants from different cultural and linguistic background. We will make our data-set (the query-retrieval pairs, Collective Visual Memory and the geographical facts) publicly available at time of publication.

4 Experiments

For our experiments we use the geographical facts and a Collective Visual Memory as described in the previous section.We use a dataset consisting of query-retrieval pairs formulated by real-life users. It consists of user queries which follow no particular template and contains spatial relations in addition to those pre-defined as predicates, such “near”, “beside”, “ahead of”, “opposite to” etc.

In this section we describe the experiments conducted, state their results and discuss our observations. We further propose the concept of personalization of a media retrieval system to adapt to specific user perceptions. Finally, we provide a qualitative assessment of the usefulness of our contextual media retrieval system.

4.1 Evaluation of Learning Procedure

To study the effect of learning on prediction accuracy we first trained a model with synthetically generated query-retrieval pairs (SynthModel). The queries are generated by templates – “what is there spatial relation of X?”, “what happened here Y days/weeks/months/years ago?”, “what did this place look like in Z?”, where spatial relation {“in front”, “behind”, “on the right”, “on the left”}, X {names of buildings, cafes, restaurants etc.}, Y {natural numbers} and Z {names of months}. The contextual cues ‘here’, ‘this place’ are fixed to a particular location. The retrievals follow pre-defined rules to resolve spatial (according to the geomagnetic reference frame) and temporal relations. The untrained model was found to have a prediction accuracy of 11.23%. We observe a strong improvement of performance to 46% from as little as 200 training examples (Figure 5).

Figure 5: Effect of increasing number of training examples on prediction accuracy

Regular users use a variety of grammatical constructs as common in a spoken language. Therefore the queries collected from them were rich with a number of spatial relations not restrictive to the ones we represent as predicates (section 3.1.3). Also, the answers to similar queries were subjective. To account for the variability and subjectivity in this type of data, and to study the effect of learning on prediction accuracy, we trained a query-retrieval model on human queries and retrievals (HumanModel). As before, we used a weakly supervised learning approach that only requires query-retrieval pairs without any supervision on the logical forms. The model was trained through a human-in-the-loop training procedure using a relevance feedback mechanism. Since the human trainer was familiar with the geographic scope of our work, it was also possible to provide feedback on the retrievals of the temporal queries. We found that during the training the query-retrieval model learned to associate different spatial relations to pre-defined predicates. For example, the parser has learned to map the spatial relation “ahead of” to the predicate frontOf().

A comparison to the previous model shows that the HumanModel (26.67%) yields greater recall than the SynthModel (15.88%) on queries collected from humans (Figure 6), where recall is defined as the percentage of relevant retrievals among all test queries. This shows that our HumanModel is able to learn and adapt to the variations in natural language utterances and also interpret a variety of spatial relations in spoken queries. We use this model for our evaluations described in the following sections.

Figure 6: Recall of HumanModel and SynthModel

4.2 Model Evaluation

Since a contextual application calls for the involvement of prospective users and their satisfaction in using it, we decide on a qualitative assessment of the system.

Humans are inherently inconsistent in their perception of directions and idea of reference frames [9, 15]. The nature of understanding/speaking English questions also has variations based on a person’s socio-cultural background. Hence, a system relying on fixed question templates and a particular set of rules to resolve spatial references does not guarantee high accuracy. A satisfactory result for one person may prove to be irrelevant for another. To better understand these perceptual biases and yet efficiently analyze the system, a series of user studies were conducted.

4.2.1 Evaluation of the Retrieved Results and Human Disagreements

The goal of this user study is to observe how accurate regular users found our system. Five users were asked to evaluate the retrieved results for 500 test questions as “relevant” or “irrelevant”. The study was conducted in a lab set-up. Users looked at retrieved results for each question on a computer screen and stated whether they find the retrievals relevant or irrelevant to the question. A canonical reference frame was used in this experiment to resolve spatial relationships in queries. According to this convention, “front of” meant “north of”, “behind” meant “south of”, “right of” meant “east of” and “left of” meant “west of”.

We observed that for each question the opinions varied. Based on this observation we divide the test questions into six groups – (5,0), queries for which all five users agreed that the retrievals were relevant; (4,1), queries for which four users found the retrievals relevant and one user found them irrelevant and likewise. Figure 7 depicts the result of this analysis. For 26.67% of the queries all five users deemed the retrievals relevant. However, if we consider the cases in which most of the users found the retrievals relevant, this number rises to 40%.

Figure 7: Inter-user variability in opinion

The numbers in the middle region of the graph in Figure 7 point out the prominent difference in opinions among participants. This accounts for about 25% of all queries. We observed that the inter-user variability stems from the inherent inconsistencies with regards to reference frame resolution. This result also hints towards the difficulty of the problem at hand since satisfactory answers for one user may be unsatisfactory for others. The high agreement in the last column is because of some unavoidable factors - scanty media content (our geographic scope could not be well covered in images and videos due to lack of infrastructure), incorrect POS tagging (this resulted in incorrect retrievals type, for e.g. text), etc.

From the observation from this user study – that human disagreed in their opinion of relevance and irrelevance – we conjecture that instead of using the geomagnetic reference frame, the use of user-centric reference frames for retrieving answers could improve the performance of the system. In the deployment of the user-centric reference frame we mean to follow the user’s physical egocentric directions – for example, her ‘right hand side’ for “right of” etc. (explained in greater details in section 3.1.3).

4.2.2 Canonical and User-centric Reference Frame

In order to study the impact of using two different conventions of spatial relations resolution, we conducted this user study. Users were given two sets of retrieved results for each question – one set of media files retrieved according to the geomagnetic reference frame and the second set retrieved according to the user-centric reference frame. The experimental settings are similar to the previous user study.

Figure 8 shows the result of this user study. user1 and user3 remained neutral to the use of separate reference frames while the other users slightly preferred the canonical reference frame over the user-centric reference frame. This observation further highlights the subjectivity of the task.

Figure 8: Difference in reference frame resolution among humans

4.2.3 Personalization of Xplore-M-Ego

U1U2U3U4U5M1( 27.79.616.2313.3517.27) M221.4637.835.625.3626.58M318.4815.743.8517.8733.9M415.4225.8735.541.2529.85M514.0718.5938.728.6462.43
(a) Precision
U1U2U3U4U5M1( 23.398.213.6811.2514.56) M219.4234.2232.222.924.06M313.611.4732.513.0224.72M413.6822.9531.533.3326.49M56.188.1616.912.5827.15
(b) Recall
U1U2U3U4U5M1( 25.368.8414.8412.2115.79) M220.3835.933.924.0625.25M315.6613.2537.3315.0628.59M414.4924.3233.3836.8628.06M58.5811.3423.5317.4837.84
(c) F1-score
Figure 9: Quantitative analysis of personalization of Xplore-M-Ego

Having observed this inter-person subjectivity, we hypothesize that personalization of our media retrieval system would increase its accuracy on a per user basis. The user study which we discuss in this section was conducted to investigate this hypothesis.

By using an online relevance feedback mechanism, five users () were asked to train five different query-retrieval models ( ) with 500 questions from the data-set collected from regular users. Every user was then asked to evaluate all five models keeping the identity of the model trained by each of them hidden.

The quantitative analysis of this study – precision111, recall222 and F1-score333 – are shown in Figure 9. The diagonals show the user-specific evaluation results and the rows depict inter-user evaluation results. The difference in opinion among the users is very prominent, highlighting the challenge involved in the machine understanding of hidden human intent in natural language. Nonetheless, it is clear from the figure that users deemed their own models more accurate than those trained by others. This observation leads us to believe that the query-retrieval model can be trained over time through relevance feedback to adapt to user-specific preferences of spatial relation resolution – hence, it should be personalized. This consolidates our hypothesis – personalization of our media retrieval system increases its accuracy on a per user basis.

4.2.4 User Experience Evaluation

To understand the usefulness of our contextual media retrieval system, we made an usability/desirability study. 10 participants were given the Google Glass installed with our client-side application and asked to walk around in the university campus while making voice queries that involve spatio-temporal references. Afterward they were asked to fill in the USE Questionnaire [12]

. This questionnaire has four groups of questions – Usefulness, Ease of Use, Ease of Learning and Satisfaction. Each question can be rated on a scale from 1 to 7, 1 meaning ‘strongly disagree’ and 7 meaning ‘strongly agree’. 10 questions most representative of the entire questionnaire are chosen. The mean and standard deviation of the ratings of these questions are shown in Table 


USE Questionnaire Mean SD
It is useful. 6.2 0.63
It saves me time when I use it. 6.1 0.73
It is easy to use. 6.3 0.48
I can use it without written instructions. 5.8 1.22
Both occasional and regular users would like it. 5.4 1.42
I learned to use it quickly. 6.5 0.52
I quickly became skillful with it. 6.1 0.99
I am satisfied with it. 5.5 0.52
It is fun to use. 6.3 0.82
I would recommend it to a friend. 5.9 0.73
Table 1: User Experience Evaluation: Mean Rating and Standard Deviation. The grades are between 1 (’strongly disagree’) and 7 (’strongly agree’)

The result of this evaluation shows that regular users strongly agree that our contextual media retrieval application is useful in daily life. Moreover, they find the application easy to use, very easy to learn and they are satisfied with the outcome.

5 Discussion

Due to the complex nature of our problem that involves natural language queries, media and map data, human concepts, in particular of spatial and temporal language, and complex contextual cues, we have faced a wide range of challenges. We highlight three of these in this section and discuss limitations and future work.

Frame of reference: Proper spatial resolution is required in a successful communication with machines. Unfortunately, there is no a unique frame of reference, and hence even a simple statement involving “left of” has different meanings for different users. However, our findings suggests two promising research directions in the reference frame resolution task. First, our inspection of the user study shows that the users often resolve spatial relations in spoken language for the navigation task according to the frontal direction of the physical object (from observations of section 4.2.2). Hence, a suitable map database that stores information about the frontal direction of the objects would help. We are unaware of such database or efforts to augment existing map data with such meta information. Second, our study on personalized Xplore-M-Ego suggests a more individual approach where the architecture learns to understand spatial relations by interacting with the user. While our online learning approach shows a first promising step in this directions, more complex models of person specific biases and shared notions across users could further improve the learning.

Diversity of Named Entities: Our approach uses a static database that contains information about the geographical entities, for instance the name of the entity, extracted from the OpenStreetMap. However, the participants in our study use a number of different names to refer to the same entity – the formal full name, an acronym, a popular name, or even a name in a different language. Handling such diversity is a complicated task for the semantic parser, and hence we resort to manually adding all possible common names for each entity in the database. However, such human annotation may still be incomplete. An alternate method for handling the coverage of the database is to use suitable knowledge-bases containing acronyms and regional names of geographical entities or crawling additional web resources. To the best of our knowledge, such information about synonyms of map entities is currently not pursued, but would greatly benefit applications that relate to map data such as ours.

Scalability: The program induction step of the semantic parser, where a logical form is searched over a large space of possible predicates and theirs relations (Eq. 1 and 2), is computationally demanding, and does not scale well with a large number of predicates representing geographical facts. We deal with this problem by reducing the spatial scope to a university campus. In deployment, we envision a system that directly works in a spatial scope of the user, and updates the database by geographical facts in while the user is relocating in space and time.

6 Conclusion

In this paper we proposed Xplore-M-Ego – a novel system for media retrieval using spatio-temporal natural language queries in a dynamic setting. Our work brings forth a new direction to this paradigm by exploiting a user’s current context. Our approach is based on a semantic parser that infers interpretations of the natural language queries. We contribute several extensions which enable the user to dynamically refer to his/her context by spatial and temporal concepts. We further analyzed the system in the various user studies that highlight the importance of our adaptive and personalized training approaches.


  • Berant and Liang [2014] J. Berant and P. Liang. Semantic parsing via paraphrasing. In Proceedings of ACL, 2014.
  • Clarke et al. [2010] J. Clarke, D. Goldwasser, M.-W. Chang, and D. Roth. Driving semantic parsing from the world’s response. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pages 18–27. Association for Computational Linguistics, 2010.
  • Guadarrama et al. [2013] S. Guadarrama, L. Riano, D. Golland, D. Gouhring, Y. Jia, D. Klein, P. Abbeel, and T. Darrell. Grounding spatial relations for human-robot interaction. In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, pages 1640–1647. IEEE, 2013.
  • Hakeem et al. [2009] A. Hakeem, M. W. Lee, O. Javed, and N. Haering. Semantic video search using natural language queries. pages 605–608, 2009.
  • Haklay and Weber [2008] M. Haklay and P. Weber. Openstreetmap: User-generated street maps. Pervasive Computing, IEEE, 7(4):12–18, 2008.
  • Hwang et al. [2007] M. Hwang, H. Kong, S. Baek, and P. Kim.

    A method for processing the natural language query in ontology-based image retrieval system.

    In Adaptive Multimedia Retrieval: User, Context, and Feedback, pages 1–11. Springer, 2007.
  • Kucuktunc et al. [2007] O. Kucuktunc, U. Güdükbay, and Ö. Ulusoy. A natural language-based interface for querying a video database. IEEE MultiMedia, 14(1):83–89, 2007.
  • Lan et al. [2012] T. Lan, W. Yang, Y. Wang, and G. Mori. Image retrieval with structured object queries using latent ranking svm. In Computer Vision–ECCV 2012, pages 129–142. Springer, 2012.
  • Levinson [2003] S. C. Levinson. Space in language and cognition: Explorations in cognitive diversity, volume 5. Cambridge University Press, 2003.
  • Liang et al. [2013] P. Liang, M. I. Jordan, and D. Klein. Learning dependency-based compositional semantics. Computational Linguistics, 39(2):389–446, 2013.
  • Lum and Kim [1992] V. Lum and K.-c. K. Kim.

    Intelligent natural language processing for media data query.

    In Proc. 2nd Int. Golden West Conf. on Intelligent Systems, 1992.
  • Lund [2001] A. M. Lund. Measuring usability with the use questionnaire. Usability interface, 8(2):3–6, 2001.
  • Malinowski and Fritz [2014a] M. Malinowski and M. Fritz. A pooling approach to modelling spatial relations for image retrieval and annotation. arXiv:1411.5190 [cs.CV], November 2014a. URL
  • Malinowski and Fritz [2014b] M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NIPS), pages 1682–1690, 2014b.
  • Malinowski and Fritz [2014c] M. Malinowski and M. Fritz. Towards a visual turing challenge. In NIPS Workshop on Learning Semantics, 2014c.
  • Regier and Carlson [2001] T. Regier and L. A. Carlson. Grounding spatial language in perception: an empirical and computational investigation. Journal of Experimental Psychology: General, 130(2):273, 2001.
  • Snavely et al. [2006] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3d. ACM transactions on graphics (TOG), 25(3):835–846, 2006.
  • Tellex and Roy [2009] S. Tellex and D. Roy. Towards surveillance video search by natural language query. In Proceedings of the ACM International Conference on Image and Video Retrieval, page 38. ACM, 2009.
  • Tompkin et al. [2012] J. Tompkin, K. I. Kim, J. Kautz, and C. Theobalt. Videoscapes: exploring sparse, unstructured video collections. ACM Transactions on Graphics (TOG), 31(4):68, 2012.
  • Tompkin et al. [2013] J. Tompkin, F. Pece, R. Shah, S. Izadi, J. Kautz, and C. Theobalt. Video collections in panoramic contexts. In Proceedings of the 26th annual ACM symposium on User interface software and technology, pages 131–140. ACM, 2013.
  • Wong and Mooney [2007] Y. W. Wong and R. J. Mooney. Learning synchronous grammars for semantic parsing with lambda calculus. In Annual Meeting-Association for computational Linguistics, volume 45, page 960. Citeseer, 2007.
  • Wu and Tory [2009] F. Wu and M. Tory. Photoscope: visualizing spatiotemporal coverage of photos for construction management. pages 1103–1112, 2009.
  • Zettlemoyer and Collins [2012] L. S. Zettlemoyer and M. Collins. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. arXiv preprint arXiv:1207.1420, 2012.