Online Spatial Concept and Lexical Acquisition with Simultaneous Localization and Mapping

by   Akira Taniguchi, et al.

In this paper, we propose an online learning algorithm based on a Rao-Blackwellized particle filter for spatial concept acquisition and mapping. We have proposed a nonparametric Bayesian spatial concept acquisition model (SpCoA). We propose a novel method (SpCoSLAM) integrating SpCoA and FastSLAM in the theoretical framework of the Bayesian generative model. The proposed method can simultaneously learn place categories and lexicons while incrementally generating an environmental map. Furthermore, the proposed method has scene image features and a language model added to SpCoA. In the experiments, we tested online learning of spatial concepts and environmental maps in a novel environment of which the robot did not have a map. Then, we evaluated the results of online learning of spatial concepts and lexical acquisition. The experimental results demonstrated that the robot was able to more accurately learn the relationships between words and the place in the environmental map incrementally by using the proposed method.


page 1

page 6


SpCoSLAM 2.0: An Improved and Scalable Online Learning of Spatial Concepts and Language Models with Mapping

In this paper, we propose a novel online learning algorithm, SpCoSLAM 2....

Unsupervised Lexical Acquisition of Relative Spatial Concepts Using Spoken User Utterances

This paper proposes methods for unsupervised lexical acquisition for rel...

Spatial Concept-Based Navigation with Human Speech Instructions via Probabilistic Inference on Bayesian Generative Model

Robots are required to not only learn spatial concepts autonomously but ...

An Online Learning-based Framework for Tracking

We study the tracking problem, namely, estimating the hidden state of an...

Hierarchical Bayesian Model for the Transfer of Knowledge on Spatial Concepts based on Multimodal Information

This paper proposes a hierarchical Bayesian model based on spatial conce...

Online Learning of a Probabilistic and Adaptive Scene Representation

Constructing and maintaining a consistent scene model on-the-fly is the ...

I Introduction

Robots coexisting with humans and operating in various environments are required to adaptively learn and use the spatial concepts and vocabulary related to different places. However, spatial concepts are such that their target domain may be unclear compared with object concepts and may differ according to the user and environment. Therefore, it is difficult to manually design spatial concepts in advance, and it is desirable for robots to autonomously learn spatial concepts based on their own experiences.

The related research fields of semantic mapping and place categorization [1, 2] have attracted considerable interest in recent years. However, most of these studies have consisted of separate independent methods of semantics of places and mapping using simultaneous localization and mapping (SLAM) [3]. In addition, the semantics of places, place categories, and names of places could only be learned from pre-set values. In this paper, we propose a novel unsupervised Bayesian generative model and an online learning algorithm that can perform simultaneous learning of the spatial concepts and an environmental map from multimodal information. The proposed method can automatically and sequentially perform place categorization and learn unknown words without prior knowledge.

Fig. 1: Overview of online learning of spatial concepts and an environmental map; We aim to develop a method that enables mobile robots to learn spatial concepts, a lexicon and an environmental map sequentially from interaction with an environment and human, even in an unknown environment without prior knowledge.

Taniguchi et al. [4] proposed a method that integrated ambiguous speech-recognition results with the self-localization method for learning spatial concepts. In addition, Taniguchi et al. [5] proposed the nonparametric Bayesian spatial concept acquisition method (SpCoA) based on an unsupervised word-segmentation method known as latticelm [6]. On the other hand, Ishibushi et al. [7]

proposed a self-localization method that exploits image features using a convolutional neural network (CNN) 

[8]. These methods [4, 5, 7] cannot cope with changes in the names of places and the environment because these methods use batch learning algorithms. In addition, these methods cannot learn spatial concepts from unknown environments without a map, i.e., the robot needs to have a map generated by SLAM beforehand. Therefore, in this paper, we develop an online algorithm that can sequentially learn a map, spatial concepts integrating positions, speech signals, and scene images.

FastSLAM [9, 10] has realized an on-line algorithm for efficient self-localization and mapping using a Rao-Blackwellized particle filter (RBPF) [11]. In this paper, we introduce a grid-based FastSLAM algorithm in the generative model for spatial concept acquisition. The graphical model of SpCoA has integrated spatial lexical acquisition into Monte Carlo localization (MCL), a particle-filter-based self-localization method. SpCoA can be extended naturally to SLAM. Therefore, we assume that the robot can learn vocabulary related to places and a map sequentially.

One of the important problems of our research is unsupervised lexical acquisition. There are research efforts on incremental spatial language acquisition through robot-to-robot interaction [12, 13]. However, these studies [12, 13] did not consider lexical acquisition through human-to-robot speech interactions (HRSI). For online unsupervised lexical acquisition by HRSI, it is necessary to deal with the problems of phoneme recognition errors and word segmentation of uttered sentences containing errors. SpCoA reduced phoneme recognition errors of word segmentation by using the weighted finite-state transducer (WFST)-based unsupervised word segmentation method latticelm [6]. Araki et al. [14] performed a pseudo-online algorithm using the nested Pitman–Yor language model (NPYLM) [15]. However, these studies [5, 14] have reported that word segmentation of speech recognition results including errors causes over-segmentation [16]. In this paper, we will improve the accuracy of speech recognition by updating the language models sequentially.

We assume that the robot has not acquired any vocabulary in advance, and can recognize only phonemes or syllables. We represent the spatial area of the environment in terms of a position distribution. Furthermore, we define a spatial concept as a place category that includes place names, scene image features, and the position distributions corresponding to those names.

The goal of this study is to develop a robot that learns spatial concepts incrementally from multimodal information obtained while moving in the environment. The main contributions of this paper are as follows.

  • We propose an online algorithm based on RBPF for spatial concept acquisition. The proposed method integrates SpCoA and FastSLAM in the theoretical framework of the Bayesian generative model.

  • We demonstrated that a robot without a pre-existing lexicon or map can learn spatial concepts and an environmental map incrementally.

Ii Online Spatial Concept Acquisition

Ii-a Overview

An overview of the proposed method is shown in Fig. 1. The proposed method is an online spatial concept acquisition and simultaneous localization and mapping (SpCoSLAM). The proposed method can learn sequential spatial concepts for unknown environments and unsearched regions without maps. In addition, it can mutually complement the uncertainty of information by using multimodal information. A pseudo-code for the online learning is given in Algorithm 1. The procedure of SpCoSLAM for each step is described as follows. 2) – 6) are performed for each particle.

  1. A robot gets WFST speech recognition results of the user’s speech signals using a language model of the previous step. (line 3 in Algorithm 1)

  2. The robot gets the observation likelihood by performing a sample motion model and a measurement model of FastSLAM. (line 5-10)

  3. The robot performs unsupervised word segmentation latticelm [6] using WFST speech recognition results. (line 11)

  4. The robot gets latent variables of spatial concepts by sampling. The details of this process are described in Section II-E. (line 12)

  5. The robot gets the marginal likelihood of observation data as the importance weight. (line 13-15)

  6. The robot updates an environmental map. (line 16)

  7. The robot estimates the set of parameters of spatial concepts from data and sampled values. (line 17)

  8. The robot updates a language model of the maximum weight for next step. (line 20-21)

  9. The robot performs resampling of particles according to weights. (line 22-25)

Fig. 2: Graphical model representation of SpCoSLAM; It expresses multimodal place categorization, lexical acquisition and SLAM as one Bayesian generative model. Gray nodes indicate observation variables.
Self-position of a robot
Sensor data (depth data)
Control data
Image feature
Speech signal
Sequence of words (word segmentation result)
Index of spatial concepts
Index of position distributions
Environmental map
Multinomial distribution of index of spatial concepts
Multinomial distribution of index of position distribution
, Position distribution

(mean vector, covariance matrix)

Multinomial distribution of image feature
Multinomial distribution of the names of places
Language model (word dictionary)
Acoustic model for speech recognition
,,,, ,,, Hyperparameters of prior distributions
TABLE I: Each element of the graphical model of SpCoSLAM

Ii-B Definition of generative model and graphical model

Figure 2 shows the graphical model of SpCoSLAM and Table I lists each variable of the graphical model. We describe the formulation of the generation process represented by the graphical model as follows:


where represents Dirichlet process, is multinomial distribution, is Dirichlet distribution, is inverse–Wishart distribution, and

is Gaussian distribution.

Equation (6) approximates by using unigram rescaling [17], as shown in (15). represents the approximation by unigram rescaling.


where denotes the number of words in the sentence.

Then, the probability distribution for (

14) can be defined as follows:


Ii-C Formulation of the speech recognition and the unsupervised word segmentation

The 1-best speech recognition and the WFST speech recognition are represented as follows:


where denotes the speech recognition result of WFST format, which is a word graph representing the speech recognition results. The unsupervised word segmentation of WFST by latticelm [6] is represented as follows:

1:procedure ()
4:     for  to  do
7:          for  to  do
9:          end for
19:     end for
22:     for  to  do
23:          draw with probability
24:          add to
25:     end for
26:     return
27:end procedure
Algorithm 1 Online learning algorithm of SpCoSLAM

Ii-D Online spatial concept acquisition and mapping

Here, we describe the derivation of formulas for the online algorithm. The online learning algorithm of the proposed method can be derived by introducing sequential update equations for estimating the parameters of the spatial concepts into the formulation of FastSLAM based on RBPF. The proposed method assumes grid-based FastSLAM 2.0 [9, 10] algorithm. Algorithm 1 is the online learning algorithm of SpCoSLAM. As an advantage of using a particle filter, parallel processing can be easily applied because each particle can be calculated independently.

In the formulation of FastSLAM, the joint posterior distribution is factorized as follows:


This factorization represents a decomposition into two calculations: the mapping and self-localization by RBPF.

In the formulation of SpCoSLAM, the joint posterior distribution can be factorized to the probability distributions of a language model , a map , the set of model parameters of spatial concepts

, and the joint distribution of trajectory of self-position

and the set of latent variables . We describe the joint posterior distribution of SpCoSLAM as follows:


where the set of hyperparameters is denoted as . Note that the speech signal is not observed at all times. In this paper, the proposed method is equivalent to FastSLAM at the time when is not observed.

The particle filter algorithm uses sampling importance resampling (SIR). We describe the importance weight for each particle as follows:


where the particle index is . The number of particles is . Henceforth, equations are also calculated for each particle , but the subscripts representing the particle index are omitted.

We describe the target distribution as follows:


We describe the proposal distribution as follows:


The weight is represented by (22), (23), and (24) as follows:


We assume the proposal distribution at time as follows:


Then, is equivalent to the proposal distribution of FastSLAM 2.0.

The term of and is the marginal distribution regarding the set of model parameters . This distribution can be calculated by a formula equivalent to collapsed Gibbs sampling. We describe the equation for sampling and simultaneously as follows:


The details of (27) are described in Section II-E.

We approximate the term of by speech recognition using the language model and unsupervised word segmentation using the WFST speech recognition results as follows:

In the formulation of (21), it is desirable to estimate the language model for each particle. However, in this case, it is necessary to perform speech recognition of the number of data times the number of particles for each teaching utterance. In order to reduce the computational cost, we use a language model of a particle with the maximum weight for speech recognition of the next step.

Finally, is represented as follows:


This is an equation obtained by multiplying the weight at a previous time with the marginal likelihoods for , , and .

Ii-E Simultaneous sampling of indices and

The proposed method uses the Chinese restaurant process (CPR) [18], which is one of the constitution methods of the Dirichlet process (DP). We describe the distribution of using the CRP representation as follows:


where denotes the number of data allocated to the -th spatial concept in all data up to the time . The number of data is .

We describe the distribution of by the CRP representation as follows:


where denotes the number of data allocated to the -th position distribution in data allocated to the -th spatial concept.

Therefore, the joint prior distribution of and is represented as follows:


The probability of words is represented as follows:


where denotes the number of types of words, i.e., the number of dimensions of the multinomial distribution of the names of places and denotes the total number of words of the -th dimension allocated to the -th multinomial distribution of the names of the places in words .

The probability of image features is represented as follows:


where denotes the number of dimensions of image features, denotes the total number of image features of the -th dimension allocated to the -th multinomial distribution of image features in image features , and .

The probability of self-position of the robot is described as follows:


where the function denotes the multivariate Student’s t-distribution [19]. Then, the posterior parameters in (35) are represented as follows:


where and are the number of data and the set of position data, respectively, allocated to the position distribution of in data up to the time .

From the above, (27) can be expressed as follows:

Iii Experiments

We performed experiments for online learning of spatial concepts from a novel environment. In addition, we performed evaluations of place categorization and lexical acquisition related to place. We compare the performance of four methods as follows:

  1. SpCoSLAM

  2. Online SpCoA based on RBPF

  3. Online SpCoA

  4. SpCoA (Batch learning) [5]

Methods (A), (B), and (C) performed online learning algorithms based on the CRP representation. Methods (B), (C), and (D) based on SpCoA did not perform the update of a language model and did not use image features. Method (D) performed Gibbs sampling based on a weak-limit approximation [20] of the stick-breaking process (SBP) [21], i.e., the upper limit numbers of spatial concepts and position distributions were set as and respectively. In the batch learning (D), we performed Gibbs sampling for 100 iterations.

Fig. 3: Learning results of each position distribution in a generated map; Ellipses denoting the position distributions are drawn on the map at steps 15, 30, and 50. The colors of the ellipses were determined randomly. Furthermore, each index number is denoted as .

Iii-a Online learning

We conducted experiments for online spatial concept acquisition in a real environment. We extended the gmapping package, implementing the grid-based FastSLAM 2.0 [9, 10] in the robot operating system (ROS). We used an open dataset (albert-b-laser-vision) containing a rosbag file in which the odometry, laser range data, and vision data were recorded. This dataset was obtained from the Robotics Data Set Repository (Radish) [22]. The authors thank Cyrill Stachniss for providing this data. We prepared a Japanese speech signal data corresponding to the movement of the robot of the above dataset because it did not include speech signal data. The number of teaching places was 10 and there were nine place names. The teaching utterances included 10 types of various phrases. The total number of utterances was 50. The employed microphone was a SHURE PG27-USB. The speech recognition system uses Julius dictation-kit-v4.3.1-linux (GMM-HMM decoding) [23]. The initial word dictionary of the Julius system contains 115 Japanese syllables. The unsupervised word segmentation system uses latticelm [6]

. We used a deep learning framework Caffe 

[24] for CNNs as an image feature extractor. We used a pre-trained CNN, i.e., Places205-AlexNet trained on 205 scene categories of Places Database with images [25]. The map resolution was 0.05 m/grid. The number of particles was . The hyperparameters were set as follows: , , , , , , , and . The above parameters were set so that all methods in the comparison were tested under the same conditions.

Fig. 3 shows the position distributions in the environmental maps at steps 15, 30, and 50. The upper part of this figure shows an example of the image corresponding to each position distribution, the correct phoneme sequence of the name of the place, and the upper three words of the probability value estimated by the probability distribution at step . As a result, Fig. 3 shows how the spatial concepts are acquired while sequentially mapping. Details on online learning experiment can be seen in the video attachment.

Iii-B Estimation accuracy of spatial concepts

We compare the matching rate for the estimated index

of the spatial concept of each teaching utterance and the classification results of correct answers by a person. In this experiment, the evaluation metric uses the normalized mutual information (NMI), which is a measure of the degree of similarity between two clustering results. The estimated index

of the position distributions is also evaluated in the same manner. In addition, we evaluate the estimated number of spatial concepts and position distributions by using the estimation accuracy rate (EAR). The EAR was calculated as follows:


where is the correct number and is the estimated number.

Table II lists the evaluation-value averages calculated using the metrics NMI and EAR at step 50. Fig. 4 shows the average of the NMI values in 10 trials by online learning. In both and , the NMI values tended to rise at the beginning. The NMI values of were similar for methods (A), (B), and (C). In the NMI values for , the proposed method (A) showed higher values than the other methods after step 30. We consider a major possible reason for the clustering results of spatial concepts. In online lexical acquisition, the word segmentation results cannot be obtained stably when training dataset is small. We consider that stable words can be obtained by further increasing the number of training steps. Fig. 5 shows the average of the number of spatial concepts and the number of position distributions in 10 trials by online learning. The average values of the estimated results of method (D) were , . True data was determined by a user based on teaching data. The experimental results show that the proposed method (A) was closer to the true data than other methods for both and .

(a) NMI values of (b) NMI values of
Fig. 4: Accuracy rates of the estimation results for (a) index of spatial concepts and (b) index of position distributions
(a) Number of spatial concepts (b) Number of position distributions
Fig. 5: Estimation results for (a) the number of spatial concepts and (b) the number of position distributions
(A) SpCoSLAM 0.347 0.744 0.913 0.964
(B) Online SpCoA (RBPF) 0.314 0.716 0.341 0.682
(C) Online SpCoA 0.348 0.699 0.344 0.770
(D) SpCoA [5] 0.805 0.856 0.000 0.690
TABLE II: Evaluation values of NMI and EAR for each method

Iii-C Comparison of the number of segmented words

We show whether a phoneme sequence including the name of a place is properly segmented. Fig. 6

shows the number of segmented words. The morphological segmentation (purple line) was suitably segmented into Japanese morphemes using MeCab, which is an off-the-shelf Japanese morphological analyzer that is widely used for natural language processing. The phrase segmentation (yellow line) was the number of words in the case of segmenting words only before and after the name of the place, i.e., we assume that a phrase other than the name of the place is one word. Table 

III presents examples of the word segmentation results of the four methods. Method (A) was similar to the phrase segmentation. On the other hand, methods (B) and (C) showed results of over-segmentation. In addition, the average value of the number of segmented words of method (D) was 391.4, i.e., it was similar to methods (B) and (C) at step 50. The results indicate that method (A) improved the problem of over-segmentation by updating the language model sequentially.

Fig. 6: Number of segmented words
English “We come to the end of corridor.” The faculty laboratory is here.” “This place name is the break room.”
Morpheme ikidomarinikimashita kyouiNkeNkyuushitsuwakochiradesu konobasyononamaewakyuukeijo
Phrase ikidomarinikimashita kyouiNkeNkyuushitsuwakochiradesu konobasyononamaewakyuukeijo
(A) aaerikidomarinikeiwasuta kyoiiNiNteNkyushitsuwaqgochigadesu ukonomasyonamaewaakyuuqkirijo
(B), (C), (D) pikidomaenikimasya kyooiNteNkyushisuwakochigadesu konobasyononamaewakyuukeijo
TABLE III: Examples of word segmentation results of uttered sentences. denotes a word segment position

Iii-D Place recognition using a speech signal

When the robot hears a user’s speech signal including the name of a place, the robot estimates a position indicated by the uttered sentence. The user says “** ni iqte.” (which means “Go to **.” in English). The estimation of a position was calculated as follows:


In this experiment, (43) was approximated by using the speech recognition results from 1-best to 10-best as follows: