Symbol Grounding Association in Multimodal Sequences with Missing Elements

In this paper, we extend a symbolic association framework for being able to handle missing elements in multimodal sequences. The general scope of the work is the symbolic associations of object-word mappings as it happens in language development in infants. In other words, two different representations of the same abstract concepts can associate in both directions. This scenario has been long interested in Artificial Intelligence, Psychology, and Neuroscience. In this work, we extend a recent approach for multimodal sequences (visual and audio) to also cope with missing elements in one or both modalities. Our method uses two parallel Long Short-Term Memories (LSTMs) with a learning rule based on EM-algorithm. It aligns both LSTM outputs via Dynamic Time Warping (DTW). We propose to include an extra step for the combination with the max operation for exploiting the common elements between both sequences. The motivation behind is that the combination acts as a condition selector for choosing the best representation from both LSTMs. We evaluated the proposed extension in the following scenarios: missing elements in one modality (visual or audio) and missing elements in both modalities (visual and sound). The performance of our extension reaches better results than the original model and similar results to individual LSTM trained in each modality.



There are no comments yet.


page 10


On the Benefits of Early Fusion in Multimodal Representation Learning

Intelligently reasoning about the world often requires integrating data ...

Listen, Read, and Identify: Multimodal Singing Language Identification of Music

We propose a multimodal singing language classification model that uses ...

Training Strategies to Handle Missing Modalities for Audio-Visual Expression Recognition

Automatic audio-visual expression recognition can play an important role...

CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation

Visual and audio modalities are two symbiotic modalities underlying vide...

An Attempt towards Interpretable Audio-Visual Video Captioning

Automatically generating a natural language sentence to describe the con...

Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion

In this paper, we study the task of multimodal sequence analysis which a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A striking feature of the human brain is to associate abstract concepts with the sensory input signals, such as visual and audio. As a result of this multimodal association, a concept can be located and associated from a representation of one modality (visual) to another representation of a different modality (audio), and vice versa. For example, the abstract concept “ball” from the sentence “John plays with a ball” can be associated to several instances of different spherical shapes (visual input) and sound waves (audio input). Several fields, such as Neuroscience, Psychology, and Artificial Intelligence, are interested to determine all factors that are involved in binding semantic concepts to the physical world. This scenario is known as Symbol Grounding Problem [HarnadHarnad1990] and is still an open problem [SteelsSteels2008].

With this in mind, infants start learning the association when they are acquiring the language(s) in a multimodal scenario. Gershkoff-Stowe2004 found the initial set of words in infants is mainly nouns, such as dad, mom, cat, and dog. In contrast, the language development can be limited by the lack of stimulus [Andersen, Dunlea,  KekelisAndersen et al.1993, SpencerSpencer2000], i.e., deafness, blindness. Asano et al. found two different patterns in the brain activity of infants depending on the semantic correctness between a visual and an audio stimulus. In simpler terms, the brain activity is pattern ‘A’ if the visual and audio signals represent the same semantic concept. Otherwise, the pattern is ‘B’.

Related work has been proposed in different multimodal scenarios inspired by the Symbol Grounding Problem. yu2004multimodal explored a framework that learns the association between objects and their spoken names in day-to-day tasks. nakamura2011grounding introduced a multimodal categorization applied to robotics. Their framework exploited the relation of concepts in different modalities (visual, audio and haptic) using a Multimodal latent Dirichlet allocation. Previous approaches has focused on feature engineering, and the segmentation and the classification tasks are considered as independent modules.

This paper is focusing on a model for segmentation and classification tasks in the object-word association scenario. Moreover, we are interested in multimodal sequences that represent a semantic concept sequence with the constraint that not all elements can be on both modalities. For instance, one modality sequence (text lines of digits) is represented by ‘2 4 5’, and the other modality (spoken words) is represented by ‘four six five’. Also, the association problem is different from the traditional setup where the association is fixed via a pre-defined coding scheme of the classes (e.g. 1-of-K scheme) before training. We explain the difference between common approaches for multimodal machine learning and our problem setup in Section 


In this work, we investigate the benefits of exploiting the alignment between elements that are common in the multimodal sequence and still agree in a similar representation via coding scheme. Note that our work is an extension of  RaueCoCo2015 where both modalities represent the same semantic sequence (no missing elements). Similarly to RaueCoCo2015, the model was implemented by two Long Short-Term Memories (LSTMs) that their output vectors were aligned in the time axis using Dynamic Time Warping (DTW) 

[Berndt  CliffordBerndt  Clifford1994]. However, in this work, the modalities may have missing elements. Our contributions in this paper are the following

  • We propose a novel model for the cognitive multimodal association task. Moreover, our model handles multimodal sequences where the semantic concepts can be in one or both modalities. In addition, a max pooling operation in the time-axis is added to the architecture for exploiting the cross-modality of the shared semantic concepts.

  • We evaluate the presented model in two scenarios. In the first scenario, the missing semantic concepts can be in any modality. In the second scenario, the semantic concepts are missing only in one modality. For example, the visual sequence ‘1 2 3 4 5 6’ and the audio sequence ‘two four’. The semantic concepts in the audio modality are shared with the visual sequence. In contrast, some semantic concepts in the visual sequence are missed in the audio sequence. In both cases, our model performances better that the model proposed by RaueCoCo2015.

This paper is organized as follows. We shortly describe Long Short-Term Memory networks, mainly sequence classification in unsegmented inputs, in Section 2. The original end-to-end model for object-word association is explained in Section 3. The presented model for handling missing elements is presented in Section 4. A generated dataset of multimodal sequences with missing elements is described in Section 5. In Section 6, we compare the performance of the proposed extension against the original model and a single LSTM network trained on one modality in the traditional setup (pre-defined coding scheme).

Figure 1: Comparison of components between the traditional setup and our setup for associating two multimodal signals. Note that our task has an extra learnable component (relation between semantic concepts and their representations), whereas the traditional scenario is already pre-defined (red box). Moreover, the final goal is to agree on the same coding scheme for each modality.

1.1 Multimodal Tasks in Machine Learning

Machine Learning has been applied successfully to several scenarios where multimodal relation between input samples is exploited. In the following, we want to indicate the differences between previous multimodal tasks and our work.
Multimodal Feature Fusion:

The task is to combine features of different modalities for creating a better feature. In this manner, the generated feature exploits the best qualities of each modality. Recently, Deep Boltzmann Machines learns how to combine different modalities in unsupervised environment 

[Srivastava  SalakhutdinovSrivastava  Salakhutdinov2012, Sohn, Shang,  LeeSohn et al.2014].
Image Captioning:

The task is to generate a textual description given images as input. This can be seen as a machine translation from images to captions. Convolutional Neural Networks (CNN) in combination with LSTM have been already applied in cross-modality scenario, which

translates images to textual captions [Vinyals, Toshev, Bengio,  ErhanVinyals et al.2014, Karpathy, Joulin,  LiKarpathy et al.2014].
New Task - Cognitive Object-Word Association: In this work, we are interested in the cross-modality association between objects (visual) and words (audio). Moreover, our scenario is motivated by symbol grounding. With this in mind, two definitions are introduced for explanation purposes: semantic concepts (SeC) and symbolic features (SyF). We explain these definitions through the following example. A classification problem is usually defined by an input that is associated with a class (or a semantic concept) and this class is represented by a vector (or a symbolic feature). That relation is fixed and it is chosen externally (outside of the network). In contrast, the presented task includes the relation between the class and its corresponding vector representation as a learned parameter. At the end, the model not only learns to associate objects and words but also learns the symbolic structure between the semantic concepts and the symbolic features. Figure 1 shows examples of each component and the differences between the traditional setup and our task. It can be seen that our task involves only three components, whereas the traditional setup uses only two components (red box).

2 Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) is a recurrent neural network, which is capable to learn long sequences without the vanishing gradient problem 

[Hochreiter  SchmidhuberHochreiter  Schmidhuber1997, HochreiterHochreiter1998]. This architecture incorporated to Recurrent Neural Networks the concept of gates and memory cells, which are defined by


where is the input vector at time , and are the weight matrices and bias, respectively.

LSTM has been succesfully applied to several scenarios, such as, image captioning [Karpathy, Joulin,  LiKarpathy et al.2014], texture classification [Byeon, Breuel, Raue,  LiwickiByeon et al.2015], and machine translation [Sutskever, Vinyals,  LeSutskever et al.2014]. In this work, we are interested in a particular method, which LSTM is able to sequence classification in unsegmented input samples. This has been applied mainly in one-dimensional tasks (i.e., speech recognition [Graves, Fernández, Gomez,  SchmidhuberGraves et al.2006] and OCR [Breuel, Ul-Hasan, Al-Azawi,  ShafaitBreuel et al.2013]. In more detail, Graves2006 introduced Connectionist Temporal Classification (CTC). Their idea was to add an extra class (blank class (b)

) to the set of classes for aligning two sequences. One sequence is LSTM output vectors, and the other sequence is obtained by a forward-backward propagation of probabilities similar to Hidden Markov Models (HMM) 



CTC-forward-backward step requires two recursive variables forward () and backward () for generating the target vector . Similar to HMM training, the motivation is to exploit the past context and future context at time

. An LSTM is trained by Backpropagation Trough Time (BPTT) 


, and the loss function is defined by


The final step of the CTC training is to predict the label sequence given an unknown input sequence. This step is called decoding, and two methods have been proposed: Best Path Decoding and Prefix Search Decoding. Please refer to the original paper for more information [Graves, Fernández, Gomez,  SchmidhuberGraves et al.2006].

3 Multimodal Symbolic Association

In Section 1, we mentioned that this work is an extension of RaueCoCo2015. They have introduced a Symbolic Association scenario where their model learns to associate multimodal sequences and learns the semantic binding between Semantic Concepts (SeC) and vectorial representation (SyF). The initial assumption is that two different modalities (visual and audio) represent the exactly the same semantic sequence111A semantic sequence is a set of semantic concepts.. In other words, there is a one-to-one relation between the representation of the same semantic concepts in both modalities, i.e., visual and audio. For example, the semantic concept sequence ‘1 2 3’ is visually represented by a text line of digits and auditory represented by the waveforms of the spoken words ‘one two three’. Furthermore, the model is implemented using two parallel LSTM networks and an EM-style training rule. The model exploits recent results of LSTM in segmentation and classification for sequences in one-dimension, and the EM-training rule is applied for learning the agreement under the proposed constraints.

The multimodal scenario is defined by two parallel multimodal sequences (visual and audio), and each modality represents the same semantic sequence. More formally, and are the input vectors for visual and audio modalities, respectively. In addition, the semantic sequence is a sequence of semantic concepts where k is the number of semantic concepts in the sequence, and each semantic concepts is selected from a vocabulary where is the size of the vocabulary. Moreover, two bidirectional LSTM are defined for each modality and with output vector size .

Initially, each input sequence and is fed to their respectively and . Consequently, each output and is post-processed for finding the most likely symbolic features for each semantic concept in the sequence. For instance, the semantic concept ‘duck’ can be represented by the index ‘4’ (via one-hot coding vector). With this in mind, two sets of weighing concept vectors ( where ) are assigned to each LSTM (more details in Section 3.1). Afterwards, LSTM output of each modality in combination with the found representation is fed to CTC-forward-backward step (c.f. CTC layer in Section 2). Until this step, both LSTM networks have been forward propagated independently in each modality. For exploiting multimodal sequence, both CTC-forward-backward steps ( and ) are aligned between each other in the time-axis by Dynamic Time Warping (DTW) [Berndt  CliffordBerndt  Clifford1994] (more details in Section 3.2). Therefore, the latent variables of one modality can be used to train the other modality, and vice versa. Figure 2 illustrates the training algorithm with an example.

Figure 2: General overview of the original model proposed by  RaueCoCo2015 and the contributions of this paper. In this work, we include a combination module that merges the forward-backward step from one LSTM and the aligned forward-backward step from the other LSTM.

3.1 Statistical Constraint for Semantic Binding

In this symbolic association scenario, one important constraint is related to semantic concepts, which are not biding to vectorial representations before training. As mentioned, the vocabulary is a set of semantic concepts . With this in mind, a set of concept vectors is defined for learning the mapping between the semantic concepts and the output vectors. Note that two or more concepts cannot have the same representation. This component is trained in a EM-style algorithm. For explanation purposes, it is described considering only one LSTM. However, it can be applied to two LSTM networks independently.

E-step predicts the mapping between semantic concepts in the sequence and the symbolic representation given the LSTM output and the concept vectors. This is defined by


where is the LSTM output vector at time , is the concept vector, is the number of timesteps of the sequence, and is the element-wise power operation between and . Then, a matrix is assembled by concatenating . This matrix can be used for determining the mapping between each semantic concepts and their representation


where is a row-column elimination, and

are column vectors of a permutation of the identity matrix. For simplicity, the column vector

can represent j-th identity vector where i and j can or cannot be the same. In other words, the column vector can represent the 1-st identity vector (e.g., ). The row-column elimination procedure ranks all values in the matrix. Next, the position (col, row), where the maximum value is found, and determines the row-th identity column vector . For example, the maximum value is found at (2, 5), and its correspondence vector is . Finally, all values of the previous column and the previous row are set to zero. This column-row elimination is applied C times. As are result, the vectors are the mapping between semantic concepts (columns) and their representation (rows).

E-step updates the concept vectors given the LSTM output and the target statistical distribution. Hence, the cost function is defined by


where is a column vector of the identity matrix that represent the semantic concept, is the learning rate, and is the derivative w.r.t

3.2 Dynamic Time Warping (DTW)

This module exploits that both modalities represent the same semantic concept. In addition, this component converts from one modality to another modality in the time-axis because the monotonic behavior of LSTM networks in one-dimension. Moreover, both output sequences of the CTC-forward-backward training are aligned against each other based on Dynamic Time Warping (DTW) [Berndt  CliffordBerndt  Clifford1994]. The DTW matrix is calculated with the following path


where is the Euclidean distance between output vectors at timestep of and at timestep of . Afterwards, the loss function of one modality can use the other modality as a target, and vice versa. This is defined by


where and are LSTM output vector at time and , and are the CTC-forward-backward steps, and is the alignment function from time of one modality to time of the other modality via DTW path.

4 Handling Missing Elements

In this paper, we are interested in the multimodal association inspired by the symbol grounding problem for the case that the multimodal sequences have some of the semantic concepts shared between modalities but not all of them. Hence, we have update the problem definition (cf. Section 3). Two sequences of different modalities and where , represent the visual and audio modalities, , represent the timestep of each sequence. In addition, each sequence input is associated to a semantic sequence and where . As a result, semantic sequences and have some semantic concepts that are shared between modalities. In other words, there is not a one-to-one relation like in the original model. In addition, we want to exploit the shared semantic concepts via max pooling. Our contribution is to find the best representation for updating LSTM weights. This is mainly the alignment in the time-axis (Section 3.2). As a consequence, a new loss function is proposed




where is the element-wise maximum operation. The intuition behind is to give the shared semantic concept the best representation in the multimodal sequence. In contrast, the non-shared elements kept the distribution obtained from the CTC-forward-backward step.

5 Experimental Design

5.1 Datasets

We generated several multimodal datasets where the elements of the sequence are missing in one or both modalities, but the relative order between the elements is the same. For example, a visual semantic concept sequence can be represented by a text line of digits “2 4 7”, and an audio semantic concept sequence can be represented by “two seven”. In this case, we assumed a simplified scenario of symbol grounding, where the continuity of semantic concepts is different on each modality. The visual component is a horizontal arrangement of isolated objects, and the audio component is a spoken semantic concepts of some elements of the visual component, and vice versa. We want to point out that the visual component is similar to a panorama view. The procedure for generating the multimodal datasets is explained.

Generating Semantic Sequences: Two scenarios are considered for generating the semantic sequences for each modality: missing elements in both modalities and in one modality. For the first scenario, we generated a sequence of ten semantic concepts. Later, we randomly remove between zero and five elements each sequence. As a result, two different sequences for two different modalities were obtained with few common elements between them. For the second scenario, we follow a similar procedure. In that case, one sequence always has a sequence with ten semantic concepts and the other sequence has missing elements. In addition, our vocabulary has 30 semantic concepts in Spanish: oso, bote, botella, bol, caja, carro, gato, queso, cigarrillo, gaseosa, bebida, pato, cara, comida, hamburguesa, higiene, liquido, loción, cebolla, pimentón, pera, redondo, sanduche, cuchara, té, teléfono, tomate, florero, vehículo, madera.

Visual Component: We used a subset of 30 objects from COIL-100 [Nene, Nayar,  MuraseNene et al.1996]

that is a standard dataset of 100 isolated objects. Each isolated object has 72 views at different angles. After selecting the object for the sequence, each object was converted to gray scale and rescaled to 32 x 32 pixels. Later, one random isolated object for each semantic sequences was selected and all objects were stacked horizontally. In addition, a random noise was added to the final image. While the odd angles were used for training, the even angles were used for testing.

Audio Component: We recorded each semantic concept two times from twelve different subjects who are Spanish native speakers (five female and seven male speakers). Afterwards, the isolated semantic concepts were concatenated for creating audio sequences. The voices were divided into nine voices for training (four females and five males) and three voices for testing (one female and two males).

Training and Testing Multimodal Datasets: We generated three different datasets for evaluating our model. The first dataset has missing elements in both modalities. While the second dataset has ten semantic concepts in the visual component, the audio modality has a fixed number of missing elements. In other words, all audio sequences only have one missing element if we compare against the visual component. The third dataset has a similar idea with respect to the second dataset. In this case, we are testing an audio sequence with ten semantic concepts, but the visual component has a fixed number of missing elements. All of the three multimodal datasets have 50,000 sequences for training and 30,000 sequences for testing. One example of the dataset is shown in Figure 3.

Figure 3: Example of the multimodal dataset. It can be observed that only three elements are shared on both modalities.

5.2 Input Features and LSTM setup

We did not apply any pre-processing step for the visual component. In contrast, the audio component was converted to Mel-Frequency Cepstral Coefficient (MFCC) using HTK toolkit222

. The audio representation is a vector of 123 components: a Fourier filter-bank with 40 coefficients (plus energy), including the first and second derivatives. Afterwards, all audio and visual components were normalized to have zero mean and standard deviation one.

In addition, the proposed extension was compared against the original model in [Raue, Byeon, Breuel,  LiwickiRaue et al.2015]. Also, we compared the extension against LSTM with CTC layer and a pre-defined coding scheme. The parameters of the visual LSTM were: 40 memory cells, learning rate 1e-4, and momentum 0.9. On the other hand, the audio LSTM had 100 memory cells, and the learning rate and momentum are the same as in the visual LSTM. In addition, the learning rate in the statistical constraint was set to 0.001.

6 Results and Discussion

As mentioned previously, the assumption of the original model was to represent the same semantic concept sequence in both modalities. In other words, a one-to-one relation exists between modalities. In contrast, in this work, our assumption is more challenging because the semantic concept in one modality can be or cannot be present in the other modality. We evaluate the multimodal association task using Association Accuracy (AAcc), which is defined by the following equation


where is the length of the longest common sequence, and are the output classification of each modality, and are the ground-truth labels of each modality, and

is the number of elements in the dataset. In other words, we are evaluating the association between the common elements. Our model not only learns the association but also learns to classify each modality. With this in mind, we also reported the Label Error Rate (LER) as a performance metric, which is defined by


where is the output classification, is the ground-truth, and is the edit distance between the output classification and the ground-truth. In addition, we selected randomly 10,000 sequences from the training set and 3,000 sequences from the testing. We did this selection five times and reported the average results.

Model Association Label Error Rate (%)
Accuracy (%) visual audio
LSTM + CTC (baseline)
Original Model (RaueCoCo2015)
Our work
Table 1: Association Accuracy (%) and Label Error Rate (%) from the multimodal dataset that has missing elements in both modalities. It can be seen that the original model performs worse than the proposed combination. Furthermore, the presented extension reached similar results to LSTM (trained for the easier classification task, and not for association) and under some conditions reaches better results (audio component).

Table 1 summarizes the performance of LSTM trained with a pre-defined coding scheme, the original model, and the presented extension. Those results are divided into two parts as follows. First, the proposed extension handles missing elements in multimodal sequences better than the original model. It can be inferred that the max operation keeps the strongest of the common semantic concepts between modalities. Note that the representations are used for updating the weights in the backward step. Second, the proposed extension reaches similar results to the standard LSTM. In this case, LSTM was trained in each modality independently. As a reminder, we mentioned two setups for classification tasks: the traditional setup and the setup used in this work. We want to point out that the visual LSTM boost the performance of audio sequences compared to LSTM. As a result, our model reaches lower Label Error Rate in the audio sequences than the standard LSTM trained only in audio sequence.

Figure 4: Example of several LSTM outputs and their corresponding cost matrix. Note that, the common semantic concepts have the same coding scheme representation for the common elements. The DTW cost matrices show that the alignment (red line) handles missing elements.

Another outcome in this work is the conformity of the symbolic structure in both modalities, even with missing elements. Figure 4 shows examples of the coding scheme agreement. It can be observed that both LSTM networks learn to segment and classify the object-word relation in unsegmented multimodal sequences. Moreover, the common concepts in both modalities are represented by a similar symbolic feature and located at the right position in the sequence. For example, the semantic concept “redondo” (first element at the visual component and second element at the audio component) is represented by the index “27” in both modalities333There are some cases that represent one semantic concept with two different coding vectors for each network. However, both networks retrieve correctly the same concept regardless of the different coding scheme.. Note that not only the common elements, but also the missing elements are classified correctly.

Figure 5: Association Accuracy (%) of two scenarios. One modality has ten semantic concepts and the other modality has a fixed number of missing elements. The presented model (blue solid line) outperforms the original model (red dotted line) regardless of the modality and number of missing elements.
Figure 6: Label Error Rate (%) of two scenarios. Note that the performance of the network applied to the audio modality reduces its error with respect to the original model.

In addition to the considerations we made so far, we were also interested in the robustness of the presented model against the number of missing elements. This, we generated several datasets where one modality has ten semantic concepts, and the other has only fixed number of missing elements from the ten semantic concepts. Figure 5 shows the Association Accuracy of the original model and the presented model for handling missing elements. First, the original model (red dashed line) decreases its performance when the number of missing elements is increased in both modalities. These results were expected because the original model relies on one-to-one relation between modalities. Second, we recognize that the presented model (blue solid line) shows a better performance compared to the original model (red dotted line) in both modalities. Thus, we may conclude that the presented model does not reduce its performance even if 50% of elements are missing in one of the modalities. In more detail, Figure 6 shows that the cross-modality learning reduces the Label Error Rate of the network applied to the audio modality.

7 Conclusions

In summary, we have presented a solution inspired by the symbol grounding problem for the object-word association problem. Additionally, the model relies on multimodal sequences (visual and audio) where the semantic elements can be presented in one or both modalities. Further work is planned for more realistic scenarios where the visual component is not clearly segmentable. Moreover, we are interested to extend the word-association problem between a two-dimensional image and speech. With this in mind, we will incorporate visual attention mechanism in synchronization with speech. Finally, the human language development relies on how abstract concepts are associated with the real world through the sensory input, and the scenario of the symbol grounding problem can be seen as simple. However, many questions remain still open [Needham, Santos, Magee, Devin, Hogg,  CohnNeedham et al.2005, SteelsSteels2008].


  • [Andersen, Dunlea,  KekelisAndersen et al.1993] Andersen, E. S., Dunlea, A.,  Kekelis, L. 1993. The impact of input: language acquisition in the visually impaired  First Language, 13(37), 23–49.
  • [Berndt  CliffordBerndt  Clifford1994] Berndt, D. J.  Clifford, J. 1994. Using Dynamic Time Warping to Find Patterns in Time Series, 359–370.
  • [Breuel, Ul-Hasan, Al-Azawi,  ShafaitBreuel et al.2013] Breuel, T., Ul-Hasan, A., Al-Azawi, M.,  Shafait, F. 2013. High-performance ocr for printed english and fraktur using lstm networks  In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on,  683–687.
  • [Byeon, Breuel, Raue,  LiwickiByeon et al.2015] Byeon, W., Breuel, T. M., Raue, F.,  Liwicki, M. 2015. Scene labeling with lstm recurrent neural networks 

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,  3547–3555.

  • [Gershkoff-Stowe  SmithGershkoff-Stowe  Smith2004] Gershkoff-Stowe, L.  Smith, L. B. 2004. Shape and the first hundred nouns.  Child development, 75(4), 1098–114.
  • [Graves, Fernández, Gomez,  SchmidhuberGraves et al.2006] Graves, A., Fernández, S., Gomez, F.,  Schmidhuber, J. 2006. Connectionist temporal classification  In Proceedings of the 23rd international conference on Machine learning - ICML ’06,  369–376, New York, New York, USA. ACM Press.
  • [HarnadHarnad1990] Harnad, S. 1990. The symbol grounding problem  Physica D: Nonlinear Phenomena, 42(1), 335–346.
  • [HochreiterHochreiter1998] Hochreiter, S. 1998. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions  International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 06(02), 107–116.
  • [Hochreiter  SchmidhuberHochreiter  Schmidhuber1997] Hochreiter, S.  Schmidhuber, J. 1997. Long Short-Term Memory  Neural Computation, 9(8), 1735—-1780.
  • [Karpathy, Joulin,  LiKarpathy et al.2014] Karpathy, A., Joulin, A.,  Li, F. F. F. 2014. Deep fragment embeddings for bidirectional image sentence mapping  In Advances in Neural Information Processing Systems,  1889–1897.
  • [Nakamura, Araki, Nagai,  IwahashiNakamura et al.2011] Nakamura, T., Araki, T., Nagai, T.,  Iwahashi, N. 2011. Grounding of word meanings in latent dirichlet allocation-based multimodal concepts  Advanced Robotics, 25(17), 2189–2206.
  • [Needham, Santos, Magee, Devin, Hogg,  CohnNeedham et al.2005] Needham, C. J., Santos, P. E., Magee, D. R., Devin, V., Hogg, D. C.,  Cohn, A. G. 2005. Protocols from perceptual observations  Artificial Intelligence, 167(1), 103–136.
  • [Nene, Nayar,  MuraseNene et al.1996] Nene, S., Nayar, S.,  Murase, H. 1996. Columbia object image library (coil-100)  .
  • [RabinerRabiner1989] Rabiner, L. R. 1989. A tutorial on hidden markov models and selected applications in speech recognition  Proceedings of the IEEE, 77(2), 257–286.
  • [Raue, Byeon, Breuel,  LiwickiRaue et al.2015] Raue, F., Byeon, W., Breuel, T.,  Liwicki, M. 2015. Symbol Grounding in Multimodal Sequences using Recurrent Neural Network  In Workshop Cognitive Computation: Integrating Neural and Symbolic Approaches at NIPS 15.
  • [Sohn, Shang,  LeeSohn et al.2014] Sohn, K., Shang, W.,  Lee, H. 2014.

    Improved multimodal deep learning with variation of information 

    In Advances in Neural Information Processing Systems,  2141–2149.
  • [SpencerSpencer2000] Spencer, P. E. 2000. Looking without listening: is audition a prerequisite for normal development of visual attention during infancy?  Journal of deaf studies and deaf education, 5(4), 291–302.
  • [Srivastava  SalakhutdinovSrivastava  Salakhutdinov2012] Srivastava, N.  Salakhutdinov, R. R. 2012. Multimodal learning with deep boltzmann machines  In Advances in neural information processing systems,  2222–2230.
  • [SteelsSteels2008] Steels, L. 2008. The symbol grounding problem has been solved, so what’s next ?  Symbols, Embodiment and Meaning. Oxford University Press, Oxford, UK,  223–244.
  • [Sutskever, Vinyals,  LeSutskever et al.2014] Sutskever, I., Vinyals, O.,  Le, Q. V. 2014. Sequence to sequence learning with neural networks  In Advances in neural information processing systems,  3104–3112.
  • [Vinyals, Toshev, Bengio,  ErhanVinyals et al.2014] Vinyals, O., Toshev, A., Bengio, S.,  Erhan, D. 2014. Show and tell: A neural image caption generator  arXiv preprint arXiv:1411.4555.
  • [WerbosWerbos1990] Werbos, P. J. 1990. Backpropagation through time: what it does and how to do it  Proceedings of the IEEE, 78(10), 1550–1560.
  • [Yu  BallardYu  Ballard2004] Yu, C.  Ballard, D. H. 2004. A multimodal learning interface for grounding spoken language in sensory perceptions  ACM Transactions on Applied Perception (TAP), 1(1), 57–80.