A striking feature of the human brain is to associate abstract concepts with the sensory input signals, such as visual and audio. As a result of this multimodal association, a concept can be located and associated from a representation of one modality (visual) to another representation of a different modality (audio), and vice versa. For example, the abstract concept “ball” from the sentence “John plays with a ball” can be associated to several instances of different spherical shapes (visual input) and sound waves (audio input). Several fields, such as Neuroscience, Psychology, and Artificial Intelligence, are interested to determine all factors that are involved in binding semantic concepts to the physical world. This scenario is known as Symbol Grounding Problem [HarnadHarnad1990] and is still an open problem [SteelsSteels2008].
With this in mind, infants start learning the association when they are acquiring the language(s) in a multimodal scenario. Gershkoff-Stowe2004 found the initial set of words in infants is mainly nouns, such as dad, mom, cat, and dog. In contrast, the language development can be limited by the lack of stimulus [Andersen, Dunlea, KekelisAndersen et al.1993, SpencerSpencer2000], i.e., deafness, blindness. Asano et al. found two different patterns in the brain activity of infants depending on the semantic correctness between a visual and an audio stimulus. In simpler terms, the brain activity is pattern ‘A’ if the visual and audio signals represent the same semantic concept. Otherwise, the pattern is ‘B’.
Related work has been proposed in different multimodal scenarios inspired by the Symbol Grounding Problem. yu2004multimodal explored a framework that learns the association between objects and their spoken names in day-to-day tasks. nakamura2011grounding introduced a multimodal categorization applied to robotics. Their framework exploited the relation of concepts in different modalities (visual, audio and haptic) using a Multimodal latent Dirichlet allocation. Previous approaches has focused on feature engineering, and the segmentation and the classification tasks are considered as independent modules.
This paper is focusing on a model for segmentation and classification tasks in the object-word association scenario. Moreover, we are interested in multimodal sequences that represent a semantic concept sequence with the constraint that not all elements can be on both modalities. For instance, one modality sequence (text lines of digits) is represented by ‘2 4 5’, and the other modality (spoken words) is represented by ‘four six five’. Also, the association problem is different from the traditional setup where the association is fixed via a pre-defined coding scheme of the classes (e.g. 1-of-K scheme) before training. We explain the difference between common approaches for multimodal machine learning and our problem setup in Section1.1.
In this work, we investigate the benefits of exploiting the alignment between elements that are common in the multimodal sequence and still agree in a similar representation via coding scheme. Note that our work is an extension of RaueCoCo2015 where both modalities represent the same semantic sequence (no missing elements). Similarly to RaueCoCo2015, the model was implemented by two Long Short-Term Memories (LSTMs) that their output vectors were aligned in the time axis using Dynamic Time Warping (DTW)[Berndt CliffordBerndt Clifford1994]. However, in this work, the modalities may have missing elements. Our contributions in this paper are the following
We propose a novel model for the cognitive multimodal association task. Moreover, our model handles multimodal sequences where the semantic concepts can be in one or both modalities. In addition, a max pooling operation in the time-axis is added to the architecture for exploiting the cross-modality of the shared semantic concepts.
We evaluate the presented model in two scenarios. In the first scenario, the missing semantic concepts can be in any modality. In the second scenario, the semantic concepts are missing only in one modality. For example, the visual sequence ‘1 2 3 4 5 6’ and the audio sequence ‘two four’. The semantic concepts in the audio modality are shared with the visual sequence. In contrast, some semantic concepts in the visual sequence are missed in the audio sequence. In both cases, our model performances better that the model proposed by RaueCoCo2015.
This paper is organized as follows. We shortly describe Long Short-Term Memory networks, mainly sequence classification in unsegmented inputs, in Section 2. The original end-to-end model for object-word association is explained in Section 3. The presented model for handling missing elements is presented in Section 4. A generated dataset of multimodal sequences with missing elements is described in Section 5. In Section 6, we compare the performance of the proposed extension against the original model and a single LSTM network trained on one modality in the traditional setup (pre-defined coding scheme).
1.1 Multimodal Tasks in Machine Learning
Machine Learning has been applied successfully to several scenarios where multimodal relation between input samples is exploited. In the following, we want to indicate the differences between previous multimodal tasks and our work.
Multimodal Feature Fusion:
The task is to combine features of different modalities for creating a better feature. In this manner, the generated feature exploits the best qualities of each modality. Recently, Deep Boltzmann Machines learns how to combine different modalities in unsupervised environment[Srivastava SalakhutdinovSrivastava Salakhutdinov2012, Sohn, Shang, LeeSohn et al.2014].
The task is to generate a textual description given images as input. This can be seen as a machine translation from images to captions. Convolutional Neural Networks (CNN) in combination with LSTM have been already applied in cross-modality scenario, whichtranslates images to textual captions [Vinyals, Toshev, Bengio, ErhanVinyals et al.2014, Karpathy, Joulin, LiKarpathy et al.2014].
New Task - Cognitive Object-Word Association: In this work, we are interested in the cross-modality association between objects (visual) and words (audio). Moreover, our scenario is motivated by symbol grounding. With this in mind, two definitions are introduced for explanation purposes: semantic concepts (SeC) and symbolic features (SyF). We explain these definitions through the following example. A classification problem is usually defined by an input that is associated with a class (or a semantic concept) and this class is represented by a vector (or a symbolic feature). That relation is fixed and it is chosen externally (outside of the network). In contrast, the presented task includes the relation between the class and its corresponding vector representation as a learned parameter. At the end, the model not only learns to associate objects and words but also learns the symbolic structure between the semantic concepts and the symbolic features. Figure 1 shows examples of each component and the differences between the traditional setup and our task. It can be seen that our task involves only three components, whereas the traditional setup uses only two components (red box).
2 Long Short-Term Memory (LSTM)
where is the input vector at time , and are the weight matrices and bias, respectively.
LSTM has been succesfully applied to several scenarios, such as, image captioning [Karpathy, Joulin, LiKarpathy et al.2014], texture classification [Byeon, Breuel, Raue, LiwickiByeon et al.2015], and machine translation [Sutskever, Vinyals, LeSutskever et al.2014]. In this work, we are interested in a particular method, which LSTM is able to sequence classification in unsegmented input samples. This has been applied mainly in one-dimensional tasks (i.e., speech recognition [Graves, Fernández, Gomez, SchmidhuberGraves et al.2006] and OCR [Breuel, Ul-Hasan, Al-Azawi, ShafaitBreuel et al.2013]. In more detail, Graves2006 introduced Connectionist Temporal Classification (CTC). Their idea was to add an extra class (blank class (b)
) to the set of classes for aligning two sequences. One sequence is LSTM output vectors, and the other sequence is obtained by a forward-backward propagation of probabilities similar to Hidden Markov Models (HMM)[RabinerRabiner1989].
CTC-forward-backward step requires two recursive variables forward () and backward () for generating the target vector . Similar to HMM training, the motivation is to exploit the past context and future context at time
. An LSTM is trained by Backpropagation Trough Time (BPTT)[WerbosWerbos1990]
, and the loss function is defined by
The final step of the CTC training is to predict the label sequence given an unknown input sequence. This step is called decoding, and two methods have been proposed: Best Path Decoding and Prefix Search Decoding. Please refer to the original paper for more information [Graves, Fernández, Gomez, SchmidhuberGraves et al.2006].
3 Multimodal Symbolic Association
In Section 1, we mentioned that this work is an extension of RaueCoCo2015. They have introduced a Symbolic Association scenario where their model learns to associate multimodal sequences and learns the semantic binding between Semantic Concepts (SeC) and vectorial representation (SyF). The initial assumption is that two different modalities (visual and audio) represent the exactly the same semantic sequence111A semantic sequence is a set of semantic concepts.. In other words, there is a one-to-one relation between the representation of the same semantic concepts in both modalities, i.e., visual and audio. For example, the semantic concept sequence ‘1 2 3’ is visually represented by a text line of digits and auditory represented by the waveforms of the spoken words ‘one two three’. Furthermore, the model is implemented using two parallel LSTM networks and an EM-style training rule. The model exploits recent results of LSTM in segmentation and classification for sequences in one-dimension, and the EM-training rule is applied for learning the agreement under the proposed constraints.
The multimodal scenario is defined by two parallel multimodal sequences (visual and audio), and each modality represents the same semantic sequence. More formally, and are the input vectors for visual and audio modalities, respectively. In addition, the semantic sequence is a sequence of semantic concepts where k is the number of semantic concepts in the sequence, and each semantic concepts is selected from a vocabulary where is the size of the vocabulary. Moreover, two bidirectional LSTM are defined for each modality and with output vector size .
Initially, each input sequence and is fed to their respectively and . Consequently, each output and is post-processed for finding the most likely symbolic features for each semantic concept in the sequence. For instance, the semantic concept ‘duck’ can be represented by the index ‘4’ (via one-hot coding vector). With this in mind, two sets of weighing concept vectors ( where ) are assigned to each LSTM (more details in Section 3.1). Afterwards, LSTM output of each modality in combination with the found representation is fed to CTC-forward-backward step (c.f. CTC layer in Section 2). Until this step, both LSTM networks have been forward propagated independently in each modality. For exploiting multimodal sequence, both CTC-forward-backward steps ( and ) are aligned between each other in the time-axis by Dynamic Time Warping (DTW) [Berndt CliffordBerndt Clifford1994] (more details in Section 3.2). Therefore, the latent variables of one modality can be used to train the other modality, and vice versa. Figure 2 illustrates the training algorithm with an example.
3.1 Statistical Constraint for Semantic Binding
In this symbolic association scenario, one important constraint is related to semantic concepts, which are not biding to vectorial representations before training. As mentioned, the vocabulary is a set of semantic concepts . With this in mind, a set of concept vectors is defined for learning the mapping between the semantic concepts and the output vectors. Note that two or more concepts cannot have the same representation. This component is trained in a EM-style algorithm. For explanation purposes, it is described considering only one LSTM. However, it can be applied to two LSTM networks independently.
E-step predicts the mapping between semantic concepts in the sequence and the symbolic representation given the LSTM output and the concept vectors. This is defined by
where is the LSTM output vector at time , is the concept vector, is the number of timesteps of the sequence, and is the element-wise power operation between and . Then, a matrix is assembled by concatenating . This matrix can be used for determining the mapping between each semantic concepts and their representation
where is a row-column elimination, and
are column vectors of a permutation of the identity matrix. For simplicity, the column vectorcan represent j-th identity vector where i and j can or cannot be the same. In other words, the column vector can represent the 1-st identity vector (e.g., ). The row-column elimination procedure ranks all values in the matrix. Next, the position (col, row), where the maximum value is found, and determines the row-th identity column vector . For example, the maximum value is found at (2, 5), and its correspondence vector is . Finally, all values of the previous column and the previous row are set to zero. This column-row elimination is applied C times. As are result, the vectors are the mapping between semantic concepts (columns) and their representation (rows).
E-step updates the concept vectors given the LSTM output and the target statistical distribution. Hence, the cost function is defined by
where is a column vector of the identity matrix that represent the semantic concept, is the learning rate, and is the derivative w.r.t
3.2 Dynamic Time Warping (DTW)
This module exploits that both modalities represent the same semantic concept. In addition, this component converts from one modality to another modality in the time-axis because the monotonic behavior of LSTM networks in one-dimension. Moreover, both output sequences of the CTC-forward-backward training are aligned against each other based on Dynamic Time Warping (DTW) [Berndt CliffordBerndt Clifford1994]. The DTW matrix is calculated with the following path
where is the Euclidean distance between output vectors at timestep of and at timestep of . Afterwards, the loss function of one modality can use the other modality as a target, and vice versa. This is defined by
where and are LSTM output vector at time and , and are the CTC-forward-backward steps, and is the alignment function from time of one modality to time of the other modality via DTW path.
4 Handling Missing Elements
In this paper, we are interested in the multimodal association inspired by the symbol grounding problem for the case that the multimodal sequences have some of the semantic concepts shared between modalities but not all of them. Hence, we have update the problem definition (cf. Section 3). Two sequences of different modalities and where , represent the visual and audio modalities, , represent the timestep of each sequence. In addition, each sequence input is associated to a semantic sequence and where . As a result, semantic sequences and have some semantic concepts that are shared between modalities. In other words, there is not a one-to-one relation like in the original model. In addition, we want to exploit the shared semantic concepts via max pooling. Our contribution is to find the best representation for updating LSTM weights. This is mainly the alignment in the time-axis (Section 3.2). As a consequence, a new loss function is proposed
where is the element-wise maximum operation. The intuition behind is to give the shared semantic concept the best representation in the multimodal sequence. In contrast, the non-shared elements kept the distribution obtained from the CTC-forward-backward step.
5 Experimental Design
We generated several multimodal datasets where the elements of the sequence are missing in one or both modalities, but the relative order between the elements is the same. For example, a visual semantic concept sequence can be represented by a text line of digits “2 4 7”, and an audio semantic concept sequence can be represented by “two seven”. In this case, we assumed a simplified scenario of symbol grounding, where the continuity of semantic concepts is different on each modality. The visual component is a horizontal arrangement of isolated objects, and the audio component is a spoken semantic concepts of some elements of the visual component, and vice versa. We want to point out that the visual component is similar to a panorama view. The procedure for generating the multimodal datasets is explained.
Generating Semantic Sequences: Two scenarios are considered for generating the semantic sequences for each modality: missing elements in both modalities and in one modality. For the first scenario, we generated a sequence of ten semantic concepts. Later, we randomly remove between zero and five elements each sequence. As a result, two different sequences for two different modalities were obtained with few common elements between them. For the second scenario, we follow a similar procedure. In that case, one sequence always has a sequence with ten semantic concepts and the other sequence has missing elements. In addition, our vocabulary has 30 semantic concepts in Spanish: oso, bote, botella, bol, caja, carro, gato, queso, cigarrillo, gaseosa, bebida, pato, cara, comida, hamburguesa, higiene, liquido, loción, cebolla, pimentón, pera, redondo, sanduche, cuchara, té, teléfono, tomate, florero, vehículo, madera.
Visual Component: We used a subset of 30 objects from COIL-100 [Nene, Nayar, MuraseNene et al.1996]
that is a standard dataset of 100 isolated objects. Each isolated object has 72 views at different angles. After selecting the object for the sequence, each object was converted to gray scale and rescaled to 32 x 32 pixels. Later, one random isolated object for each semantic sequences was selected and all objects were stacked horizontally. In addition, a random noise was added to the final image. While the odd angles were used for training, the even angles were used for testing.
Audio Component: We recorded each semantic concept two times from twelve different subjects who are Spanish native speakers (five female and seven male speakers). Afterwards, the isolated semantic concepts were concatenated for creating audio sequences. The voices were divided into nine voices for training (four females and five males) and three voices for testing (one female and two males).
Training and Testing Multimodal Datasets: We generated three different datasets for evaluating our model. The first dataset has missing elements in both modalities. While the second dataset has ten semantic concepts in the visual component, the audio modality has a fixed number of missing elements. In other words, all audio sequences only have one missing element if we compare against the visual component. The third dataset has a similar idea with respect to the second dataset. In this case, we are testing an audio sequence with ten semantic concepts, but the visual component has a fixed number of missing elements. All of the three multimodal datasets have 50,000 sequences for training and 30,000 sequences for testing. One example of the dataset is shown in Figure 3.
5.2 Input Features and LSTM setup
We did not apply any pre-processing step for the visual component. In contrast, the audio component was converted to Mel-Frequency Cepstral Coefficient (MFCC) using HTK toolkit222http://htk.eng.cam.ac.uk
. The audio representation is a vector of 123 components: a Fourier filter-bank with 40 coefficients (plus energy), including the first and second derivatives. Afterwards, all audio and visual components were normalized to have zero mean and standard deviation one.
In addition, the proposed extension was compared against the original model in [Raue, Byeon, Breuel, LiwickiRaue et al.2015]. Also, we compared the extension against LSTM with CTC layer and a pre-defined coding scheme. The parameters of the visual LSTM were: 40 memory cells, learning rate 1e-4, and momentum 0.9. On the other hand, the audio LSTM had 100 memory cells, and the learning rate and momentum are the same as in the visual LSTM. In addition, the learning rate in the statistical constraint was set to 0.001.
6 Results and Discussion
As mentioned previously, the assumption of the original model was to represent the same semantic concept sequence in both modalities. In other words, a one-to-one relation exists between modalities. In contrast, in this work, our assumption is more challenging because the semantic concept in one modality can be or cannot be present in the other modality. We evaluate the multimodal association task using Association Accuracy (AAcc), which is defined by the following equation
where is the length of the longest common sequence, and are the output classification of each modality, and are the ground-truth labels of each modality, and
is the number of elements in the dataset. In other words, we are evaluating the association between the common elements. Our model not only learns the association but also learns to classify each modality. With this in mind, we also reported the Label Error Rate (LER) as a performance metric, which is defined by
where is the output classification, is the ground-truth, and is the edit distance between the output classification and the ground-truth. In addition, we selected randomly 10,000 sequences from the training set and 3,000 sequences from the testing. We did this selection five times and reported the average results.
|Model||Association||Label Error Rate (%)|
|LSTM + CTC (baseline)||—|
|Original Model (RaueCoCo2015)|
Table 1 summarizes the performance of LSTM trained with a pre-defined coding scheme, the original model, and the presented extension. Those results are divided into two parts as follows. First, the proposed extension handles missing elements in multimodal sequences better than the original model. It can be inferred that the max operation keeps the strongest of the common semantic concepts between modalities. Note that the representations are used for updating the weights in the backward step. Second, the proposed extension reaches similar results to the standard LSTM. In this case, LSTM was trained in each modality independently. As a reminder, we mentioned two setups for classification tasks: the traditional setup and the setup used in this work. We want to point out that the visual LSTM boost the performance of audio sequences compared to LSTM. As a result, our model reaches lower Label Error Rate in the audio sequences than the standard LSTM trained only in audio sequence.
Another outcome in this work is the conformity of the symbolic structure in both modalities, even with missing elements. Figure 4 shows examples of the coding scheme agreement. It can be observed that both LSTM networks learn to segment and classify the object-word relation in unsegmented multimodal sequences. Moreover, the common concepts in both modalities are represented by a similar symbolic feature and located at the right position in the sequence. For example, the semantic concept “redondo” (first element at the visual component and second element at the audio component) is represented by the index “27” in both modalities333There are some cases that represent one semantic concept with two different coding vectors for each network. However, both networks retrieve correctly the same concept regardless of the different coding scheme.. Note that not only the common elements, but also the missing elements are classified correctly.
In addition to the considerations we made so far, we were also interested in the robustness of the presented model against the number of missing elements. This, we generated several datasets where one modality has ten semantic concepts, and the other has only fixed number of missing elements from the ten semantic concepts. Figure 5 shows the Association Accuracy of the original model and the presented model for handling missing elements. First, the original model (red dashed line) decreases its performance when the number of missing elements is increased in both modalities. These results were expected because the original model relies on one-to-one relation between modalities. Second, we recognize that the presented model (blue solid line) shows a better performance compared to the original model (red dotted line) in both modalities. Thus, we may conclude that the presented model does not reduce its performance even if 50% of elements are missing in one of the modalities. In more detail, Figure 6 shows that the cross-modality learning reduces the Label Error Rate of the network applied to the audio modality.
In summary, we have presented a solution inspired by the symbol grounding problem for the object-word association problem. Additionally, the model relies on multimodal sequences (visual and audio) where the semantic elements can be presented in one or both modalities. Further work is planned for more realistic scenarios where the visual component is not clearly segmentable. Moreover, we are interested to extend the word-association problem between a two-dimensional image and speech. With this in mind, we will incorporate visual attention mechanism in synchronization with speech. Finally, the human language development relies on how abstract concepts are associated with the real world through the sensory input, and the scenario of the symbol grounding problem can be seen as simple. However, many questions remain still open [Needham, Santos, Magee, Devin, Hogg, CohnNeedham et al.2005, SteelsSteels2008].
- [Andersen, Dunlea, KekelisAndersen et al.1993] Andersen, E. S., Dunlea, A., Kekelis, L. 1993. The impact of input: language acquisition in the visually impaired First Language, 13(37), 23–49.
- [Berndt CliffordBerndt Clifford1994] Berndt, D. J. Clifford, J. 1994. Using Dynamic Time Warping to Find Patterns in Time Series, 359–370.
- [Breuel, Ul-Hasan, Al-Azawi, ShafaitBreuel et al.2013] Breuel, T., Ul-Hasan, A., Al-Azawi, M., Shafait, F. 2013. High-performance ocr for printed english and fraktur using lstm networks In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, 683–687.
- [Byeon, Breuel, Raue, LiwickiByeon et al.2015] Byeon, W., Breuel, T. M., Raue, F., Liwicki, M. 2015. Scene labeling with lstm recurrent neural networks
- [Gershkoff-Stowe SmithGershkoff-Stowe Smith2004] Gershkoff-Stowe, L. Smith, L. B. 2004. Shape and the first hundred nouns. Child development, 75(4), 1098–114.
- [Graves, Fernández, Gomez, SchmidhuberGraves et al.2006] Graves, A., Fernández, S., Gomez, F., Schmidhuber, J. 2006. Connectionist temporal classification In Proceedings of the 23rd international conference on Machine learning - ICML ’06, 369–376, New York, New York, USA. ACM Press.
- [HarnadHarnad1990] Harnad, S. 1990. The symbol grounding problem Physica D: Nonlinear Phenomena, 42(1), 335–346.
- [HochreiterHochreiter1998] Hochreiter, S. 1998. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 06(02), 107–116.
- [Hochreiter SchmidhuberHochreiter Schmidhuber1997] Hochreiter, S. Schmidhuber, J. 1997. Long Short-Term Memory Neural Computation, 9(8), 1735—-1780.
- [Karpathy, Joulin, LiKarpathy et al.2014] Karpathy, A., Joulin, A., Li, F. F. F. 2014. Deep fragment embeddings for bidirectional image sentence mapping In Advances in Neural Information Processing Systems, 1889–1897.
- [Nakamura, Araki, Nagai, IwahashiNakamura et al.2011] Nakamura, T., Araki, T., Nagai, T., Iwahashi, N. 2011. Grounding of word meanings in latent dirichlet allocation-based multimodal concepts Advanced Robotics, 25(17), 2189–2206.
- [Needham, Santos, Magee, Devin, Hogg, CohnNeedham et al.2005] Needham, C. J., Santos, P. E., Magee, D. R., Devin, V., Hogg, D. C., Cohn, A. G. 2005. Protocols from perceptual observations Artificial Intelligence, 167(1), 103–136.
- [Nene, Nayar, MuraseNene et al.1996] Nene, S., Nayar, S., Murase, H. 1996. Columbia object image library (coil-100) .
- [RabinerRabiner1989] Rabiner, L. R. 1989. A tutorial on hidden markov models and selected applications in speech recognition Proceedings of the IEEE, 77(2), 257–286.
- [Raue, Byeon, Breuel, LiwickiRaue et al.2015] Raue, F., Byeon, W., Breuel, T., Liwicki, M. 2015. Symbol Grounding in Multimodal Sequences using Recurrent Neural Network In Workshop Cognitive Computation: Integrating Neural and Symbolic Approaches at NIPS 15.
[Sohn, Shang, LeeSohn
Sohn, K., Shang, W., Lee, H. 2014.
Improved multimodal deep learning with variation of informationIn Advances in Neural Information Processing Systems, 2141–2149.
- [SpencerSpencer2000] Spencer, P. E. 2000. Looking without listening: is audition a prerequisite for normal development of visual attention during infancy? Journal of deaf studies and deaf education, 5(4), 291–302.
- [Srivastava SalakhutdinovSrivastava Salakhutdinov2012] Srivastava, N. Salakhutdinov, R. R. 2012. Multimodal learning with deep boltzmann machines In Advances in neural information processing systems, 2222–2230.
- [SteelsSteels2008] Steels, L. 2008. The symbol grounding problem has been solved, so what’s next ? Symbols, Embodiment and Meaning. Oxford University Press, Oxford, UK, 223–244.
- [Sutskever, Vinyals, LeSutskever et al.2014] Sutskever, I., Vinyals, O., Le, Q. V. 2014. Sequence to sequence learning with neural networks In Advances in neural information processing systems, 3104–3112.
- [Vinyals, Toshev, Bengio, ErhanVinyals et al.2014] Vinyals, O., Toshev, A., Bengio, S., Erhan, D. 2014. Show and tell: A neural image caption generator arXiv preprint arXiv:1411.4555.
- [WerbosWerbos1990] Werbos, P. J. 1990. Backpropagation through time: what it does and how to do it Proceedings of the IEEE, 78(10), 1550–1560.
- [Yu BallardYu Ballard2004] Yu, C. Ballard, D. H. 2004. A multimodal learning interface for grounding spoken language in sensory perceptions ACM Transactions on Applied Perception (TAP), 1(1), 57–80.