Identifying, classifying, and talking about objects or events in the surrounding environment are key capabilities for intelligent, goal-driven systems that interact with other agents and the external world (e.g. robots, smart spaces, and other automated systems). To this end, there has recently been a surge of interest and significant progress made on a variety of related tasks, including generation of Natural Language (NL) descriptions of images, or identifying images based on NL descriptions [karpathy2014deep, bruni2014multimodal, socher2014grounded, Farhadi09describingobjects, silberer-lapata:2014:P14-1, sun2013attribute].
Our goal is to build interactive systems that can learn grounded word meanings relating to their perceptions of real-world objects – this is different from previous work such as e.g. [roy2002describer]
, that learn groundings from descriptions without any interaction, and more recent work using Deep Learning methods (e.g.[socher2014grounded]).
Most of these systems rely on training data of high quantity with no possibility of online error correction. Furthermore, they are unsuitable for robots and multimodal systems that need to continuously, and incrementally learn from the environment, and may encounter objects they haven’t seen in training data. These limitations are likely to be alleviated if systems can learn concepts, as and when needed, from situated dialogue with humans. Interaction with a human tutor also enables systems to take initiative and seek the particular information they need or lack by e.g. asking questions with the highest information gain (see e.g. [DBLP:conf/iros/SkocajKVMJKHHKZZ11], and Fig. 1). For example, a robot could ask questions to learn the colour of a “square" or to request to be presented with more “red" things to improve its performance on the concept (see e.g. Fig. 1). Furthermore, such systems could allow for meaning negotiation in the form of clarification interactions with the tutor.
This setting means that the system must be trainable from little data, compositional, adaptive, and able to handle natural human dialogue with all its glorious context-sensitivity and messiness – for instance so that it can learn visual concepts suitable for specific tasks/domains, or even those specific to a particular user. Interactive systems that learn continuously, and over the long run from humans need to do so incrementally, quickly, and with minimal effort/cost to human tutors.
In this paper, we first outline an implemented dialogue system that integrates an incremental, semantic grammar framework, especially suited to dialogue processing – Dynamic Syntax and Type Theory with Records (DS-TTR111Download from http://dylan.sourceforge.net [Kempson.etal01, Eshghi.etal12]) with visual classifiers which are learned during the interaction, and which provide perceptual grounding for the basic semantic atoms in the semantic representations (Record Types in TTR) produced by the parser (see Fig. 1, Fig. 2 and section 3).
We then use this system in interaction with a simulated human tutor, to test hypotheses about how the accuracy of learned meanings, learning rates, and the overall cost/effort for the human tutor are affected by different dialogue policies and capabilities: (1) who takes initiative in the dialogues; (2) the agent’s ability to utilise their level of uncertainty about an object’s attributes; and (3) their ability to process elliptical as well as incrementally constructed dialogue turns. The results show that differences along these dimensions have significant impact both on the accuracy of the grounded word meanings that are learned, and the processing effort required by the tutors.
In section LABEL:sec:adaptive we train an adaptive dialogue strategy that finds a better trade-off between classifier accuracy and tutor cost.
2 Related work
In this section, we will present an overview of vision and language processing systems, as well as multi-modal systems that learn to associate them. We compare them along two main dimensions: Visual Classification methods: offline vs. online and the kinds of representation learned/used.
Online vs. Offline Learning.
A number of implemented systems have shown good performance on classification as well as NL-description of novel physical objects and their attributes, either using offline methods as in [Farhadi09describingobjects, DBLP:journals/pami/LampertNH14, DBLP:conf/nips/SocherGMN13, DBLP:journals/tkde/KongNZ13], or through an incremental learning process, where the system’s parameters are updated after each training example is presented to the system [DBLP:journals/nn/FuraoH06, DBLP:journals/nca/ZhengSFZ13, DBLP:journals/tcyb/KristanL14]
. For the interactive learning task presented here, only the latter is appropriate, as the system is expected to learn from its interactions with a human tutor over a period of time. Shen & Hasegawa DBLP:journals/nn/FuraoH06 propose the SOINN-SVM model that re-trains linear SVM classifiers with data points that are clustered together with all the examples seen so far. The clustering is done incrementally, but the system needs to keep all the examples so far in memory. Kristian & Leonardis DBLP:journals/tcyb/KristanL14, on the other hand, propose the oKDE model that continuously learns categorical knowledge about visual attributes as probability distributions over the categories (e.g. colours). However, when learning from scratch, it is unrealistic to predefine these concept groups (e.g. that red, blue, and green are colours). Systems need to learn for themselves that, e.g. colour is grounded in a specific sub-space of an object’s features. For the visual classifiers, we therefore assume no such category groupings here, and instead learn individual binary classifiers for each visual attribute (see section3.1 for details).
Distributional vs. Logical Representations.
Learning to ground natural language in perception is one of the fundamental problems in Artificial Intelligence. There are two main strands of work that address this problem: (1) those that learn distributional representations using Deep Learning methods: this often works by projecting vector representations from different modalities (e.g. vision and language) into the same space in order to be able to retrieve one from the other[socher2014grounded, DBLP:conf/cvpr/KarpathyL15, silberer-lapata:2014:P14-1]; (2) those that attempt to ground symbolic logical forms, obtained through semantic parsing [DBLP:journals/ml/TellexTJR14, kollar2013toward, DBLP:conf/aaai/MatuszekBZF14] in classifiers of various entities types/events/relations in a segment of an image or a video. Perhaps one advantage of the latter over the former method, is that it is strictly compositional, i.e. the contribution of the meaning of an individual word, or semantic atom, to the whole representation is clear, whereas this is hard to say about the distributional models. As noted, our work also uses the latter methodology, though it is dialogue, rather than sentence semantics that we care about. Most similar to our work is probably that of Kennington & Schlangen Kennington.Schlangen15 who learn a mapping between individual words - rather than logical atoms - and low-level visual features (e.g. colour-values) directly. The system is compositional, yet does not use a grammar (the compositions are defined by hand). Further, the groundings are learned from pairings of object references in NL and images rather than from dialogue.
What sets our approach apart from others is: a) that we use a domain-general, incremental semantic grammar with principled mechanisms for parsing and generation; b) Given the DS model of dialogue [Eshghi.etal15], representations are constructed jointly and interactively by the tutor and system over the course of several turns (see Fig. 1); c) perception and NL-semantics are modelled in a single logical formalism (TTR); d) we effectively induce an ontology of atomic types in TTR, which can be combined in arbitrarily complex ways for generation of complex descriptions of arbitrarily complex visual scenes (see e.g. [Dobnik.etal12] and compare this with [Kennington.Schlangen15], who do not use a grammar and therefore do not have logical structure over grounded meanings).
3 System Architecture
We have developed a system to support an attribute-based object learning process through natural, incremental spoken dialogue interaction. The architecture of the system is shown in Fig. 2
. The system has two main modules: a vision module for visual feature extraction, classification, and learning; and a dialogue system module using DS-TTR. Below we describe these components individually and then explain how they interact.
3.1 Attribute-based Classifiers used
Yu et. al yu-eshghi-lemon:2015:VL,yu-eshghi-lemon:2015:semdial point out that neither multi-label classification models nor ‘zero-shot’ learning models show acceptable performance on attribute-based learning tasks. Here, we instead use Logistic Regression SVM classifiers with Stochastic Gradient Descent (SGD)[zhang2004solving] to incrementally learn attribute predictions.
All classifiers will output attribute-based label sets and corresponding probabilities for novel unseen images by predicting binary label vectors. We build visual feature representations to learn classifiers for particular attributes, as explained in the following subsections.
3.1.1 Visual Feature Representation
In contrast to previous work [yu-eshghi-lemon:2015:VL, yu-eshghi-lemon:2015:semdial], to reduce feature noise through the learning process, we simplify the method of feature extraction consisting of two base feature categories, i.e. the colour space for colour attributes, and a ‘bag of visual words’ for the object shapes/class.
Colour descriptors, consisting of HSV colour space values, are extracted for each pixel and then are quantized to a
HSV matrix. These descriptors inside the bounding box are binned into individual histograms. Meanwhile, a bag of visual words is built in PHOW descriptors using a visual dictionary (that is pre-defined with a handmade image set). These visual words will be calculated using 2x2 blocks, a 4-pixel step size, and quantized into 1024 k-means centres. The feature extractor in the vision module presents a 1280-dimensional feature vector for a single training/test instance by stacking all quantized features, as shown in Figure2.
3.2 Dynamic Syntax and Type Theory with Records
Dynamic Syntax (DS) a is a word-by-word incremental semantic parser/generator, based around the Dynamic Syntax (DS) grammar framework [Cann.etal05a] especially suited to the fragmentary and highly contextual nature of dialogue. In DS, dialogue is modelled as the interactive and incremental construction of contextual and semantic representations [Eshghi.etal15]. The contextual representations afforded by DS are of the fine-grained semantic content that is jointly negotiated/agreed upon by the interlocutors, as a result of processing questions and answers, clarification requests, corrections, acceptances, etc. We cannot go into any further detail due to lack of space, but proceed to introduce Type Theory with Records, the formalism in which the DS contextual/semantic representations are couched, but also that within which perception is modelled here.
Type Theory with Records (TTR)
is an extension of standard type theory shown to be useful in semantics and dialogue modelling [Cooper05, Ginzburg12]. TTR is particularly well-suited to our problem here as it allows information from various modalities, including vision and language, to be represented within a single semantic framework (see e.g. Larsson Larsson13; Dobnik et al. Dobnik.etal12 who use it to model the semantics of spatial language and perceptual classification).
In TTR, logical forms are specified as record types (RTs), which are sequences of fields of the form containing a label and a type . RTs can be witnessed (i.e. judged true) by records of that type, where a record is a sequence of label-value pairs . We say that is of type just in case is of type .