MuMMER: Socially Intelligent Human-Robot Interaction in Public Spaces

by   Mary Ellen Foster, et al.
University of Glasgow

In the EU-funded MuMMER project, we have developed a social robot designed to interact naturally and flexibly with users in public spaces such as a shopping mall. We present the latest version of the robot system developed during the project. This system encompasses audio-visual sensing, social signal processing, conversational interaction, perspective taking, geometric reasoning, and motion planning. It successfully combines all these components in an overarching framework using the Robot Operating System (ROS) and has been deployed to a shopping mall in Finland interacting with customers. In this paper, we describe the system components, their interplay, and the resulting robot behaviours and scenarios provided at the shopping mall.


page 1

page 2

page 3

page 5


Mobile Robot Yielding Cues for Human-Robot Spatial Interaction

Mobile robots are increasingly being deployed in public spaces such as s...

ROS for Human-Robot Interaction

Integrating real-time, complex social signal processing into robotic sys...

A ROS Architecture for Personalised HRI with a Bartender Social Robot

BRILLO (Bartending Robot for Interactive Long-Lasting Operations) projec...

Hey Robot, Which Way Are You Going? Nonverbal Motion Legibility Cues for Human-Robot Spatial Interaction

Mobile robots have recently been deployed in public spaces such as shopp...

Aligning Robot's Behaviours and Users' Perceptions Through Participatory Prototyping

Robots are increasingly being deployed in public spaces. However, the ge...

Multimodal Signal Processing and Learning Aspects of Human-Robot Interaction for an Assistive Bathing Robot

We explore new aspects of assistive living on smart human-robot interact...

Core Challenges of Social Robot Navigation: A Survey

Robot navigation in crowded public spaces is a complex task that require...


In the EU-funded MuMMER project (, we have developed a socially intelligent interactive robot designed to interact with the general public in open spaces, using SoftBank Robotics’ Pepper humanoid robot as the primary platform [13]. The MuMMER system provides an entertaining and engaging experience to enrich a human-robot interaction. Crucially, our robot exhibits behaviour that is socially appropriate and engaging by combining speech-based conversational interaction with non-verbal communication, and motion planning. To support this behaviour, we have developed and integrated new methods from audiovisual scene processing, social-signal processing, conversational AI, perspective taking, and geometric reasoning.

Figure 1: The MuMMER robot system interacting with a customer in the Ideapark shopping mall, September 2018.

The primary MuMMER deployment location is Ideapark, a large shopping mall in Lempäälä, Finland. The MuMMER robot system has been taken to the shopping mall several times for short-term co-design activities with the mall customers and retailers [18, 19]; the full robot system has been deployed for short periods in the mall in September 2018 (Figure 1), May 2019, and June 2019, and has been installed for a long-term, three-month deployment as of September 2019.

The demo system supports a range of behaviours covering a variety of functional and entertainment tasks that are appropriate for a shopping-mall setting, including guidance to various locations within the mall, small-talk, and playing quiz games with customers. The activities during the deployment have included a number of data collection studies with real users: recording of customer interaction with the robot in guidance situations, sound localisation and automatic speech recognition in the noisy mall environment, and tests for AI-based conversation and localisation and navigation based on a partial 3D model of the mall and a complete semantic model.

In the remainder of this paper, we outline the technical contributions in each of the main MuMMER component areas: audiovisual sensing, social signal processing, conversational interaction, human-aware robot motion planning, knowledge representation and decision. At the end, we describe the details of the deployed robot system.

Audio-visual sensing

For MuMMER, the main task of audio-visual perception is sensing people in general – that is, maintaining a representation of the persons around the robot, with a dedicated attention to people susceptible of interacting with it, or those who are (or have been) interacting with it. This requires several audio-visual algorithms to detect, track, re-identify people, and detect their non-verbal behaviors and activities, and also predict their position/behaviours even when they are not seen. At the same time, the representation of people needs to be defined and shared with other modules which are responsible for inferring other knowledge about people (for instance, to define a person’s goal in the interaction).

Figure 2: The perception systems tracks and re-identifies people leaving the field, and extracts other features: speaker turns, head pose, visual focus of attention and nods.

For visual tracking, we first detect the person with the convolutional pose machines (CPM) [6] which provide accurate locations of the body joints (nose, eyes, shoulders, etc.). The output of the algorithm is almost perfect when people are in the foreground of the image and up to 3 of 5 meters, depending on the resolution. This is our use-case definition of an entertainment robot in a shopping mall. On top of the CPM, we use OpenHeadPose111 [5]

, which makes use of the heap maps of the CPM to estimate the head pose of the person. Then, we perform head pose tracking 

[20] to maintain a consistent identity across adjacent frames (Figure 2). The head pose tracker is represented by a particle filter mainly based on color and face cues. As faces are tracked, we store OpenFace features [1] which are computed on an aligned face. When a new tracklet is created, the OpenFace features of the new tracklet are compared with the features previously accumulated, and the new tracklet is re-assigned the identity of the one which had the more votes.

For sound localization , we use a multi-task neural network (NN) which jointly performs speech/non-speech detection and sound source localisation 

[15, 17]

applied on top of the 4-channel microphone array (embedded on the robot). The NN uses as an input a 4-channel audio transformed into a frequency domain, and it outputs the likelihood values for the two tasks. Thanks to a semi-automated and synthetic data collection procedure taking advantage of the robotic platform as well as the use of a weak supervision learning approach, it is possible to quickly collect data to learn the models for a new sensor 

[16]. The fusion between the visual and audio parts is done by assigning the detected speech to the person who is standing in the given direction.

Finally, although a close range (up to 1.2m) gaze sensing module is available and can be applied for one selected person using a self-calibrated approach [32], as a compromise between computation and robustness, we instead compute the visual focus of attention of each person based on the head pose [31]. The algorithm can reliably estimate the object the person is looking at (either the robot, the other persons, the targets, the shops, or the tablet embedded at the robot) which is a preliminary step to identify the addressee, and is also used in the context of perspective taking to determine whether the human has looked in the direction where the robot pointed [28].

Social signal processing

Figure 3: Social state estimator visualiser output, displaying all relevant information for fine-tuning.
Figure 4: Gesture variations. Each row shows the same gesture using different parameters.

For Social Signal Processing, we focus on two primary tasks: fusing the provided audio-visual sensing data for social state estimation, and synthesising appropriate social signals for the robot to use when communicating with users. While detecting, tracking, (re)identifying users, as well as detecting their primary non-verbal behaviours and activities provide the basic signals, the multi-modal fusion of these signals allows for a more accurate and deeper understanding of the underlying social state, including gaining personality impressions from the user. The estimated social state is then made available to inform planning of the robot’s subsequent actions; who and how to converse with the users of the robot; and, how the robot is to move and behave (gestures) in the presence of the user(s).

Social state estimation

On the fusion side, the main function of the social state estimator is to determine which user the robot should initiate interaction with. We used the underlying assumption that the robot should initiate interaction with the user perceived to be the most willing to interact; which we took to be the user paying the most attention to the robot. We assume that the user paying the most attention to the robot is the user that is looking most directly at the robot, and who is most closely situated near the robot.

To this end, the social state estimator aggregates audio-visual sensing data about the head pose of users, whether the users are looking at the robot and/or the screen on the robot, and the distance between the users’ head and the robot. The head pose data of the users are used to calculate the (Euclidean) distance between the head pose of the users and three centroids derived from clustering/classifying previously recorded lab and deployment data. This distance is then normalised to a value between zero and one, and used as a probability. The distances between the users and the robot are used as a penalty, and normalised between zero and one as a probability in such a way that users further away from the robot are penalised more than users closer to the robot.

This results in four probabilities, two taken directly from the audio-visual sensing data, and two derived from it. These four probabilities are then fused into one attention probability by calculating their (weighted) average. Choosing which user the robot should interact with is done by comparing the attention probability against a configurable minimum attention probability threshold, and then selecting the user with the highest attention probability.

To prevent immediately re-initiating an interaction with a user that the robot has just interacted with, the social state estimator also monitors the actions of the planning and dialogue components. The social state estimator then maintains a list of the users that the robot is, and has been, interacting with, and applies a penalty to their attention probability while they are interacting and for a short time afterwards.

The social state estimator is fully configurable by a set of parameters, with the initial parameter settings determined from extensive recorded lab data. The parameters were further fine-tuned during deployment to provide the most accurate and applicable social state estimates. To facilitate fine-tuning the parameters of social state estimator, a visualiser is provided to display all relevant features (Figure 3).

Social signal generation

For the synthesis side, a repertoire of non-verbal social signals, including gestures and sounds, has been developed for the robot, available to be used in conjunction with moving and interacting with the users. The non-verbal behaviour of an embodied agent is at least as communicative as its verbal behaviour [34], and in a noisy environment such as a mall it may even be more important, so understanding and controlling the robot’s non-verbal signals is crucial. Examples of some robot gesture variants are shown in Figure 4.

In a series of perception experiments, we have examined how manipulating gesture parameters affect users’ subjective responses to the robot as well as their perception of the robot’s personality. These studies have found several clear relationships: for example, manipulating the amplitude and speed significantly affected users’ perception of the Extraversion and Neuroticism of the robot, while the attributed personality also affected users’ subjective reactions to the robot [8]. In addition, it was found that while the majority of users preferred a robot that they perceived to have a similar personality to their own, a significant minority preferred a robot whose personality was perceived to be different than their own [7].

We are currently integrating a finer-grained method of gesture control based on sentiment [10], as well as a set of affectively generated artificial sounds [14], with the goal of further enhancing the robot’s expressiveness.

Conversational interaction

The MuMMER system focuses on enabling an agent to combine a task-based dialogue system with chat-style open-domain social interaction, to fulfil the required tasks while at the same time being natural, entertaining, and engaging to interact with. The presented work is based on the “Alana” conversational framework, a finalist of the Amazon Alexa Challenge in both 2017 [24] and 2018 [9]. Alana was initially developed for the Amazon Echo as an open-domain social chatbot. For the needs of this project, Alana acts as the core module for every dialogue interaction with the user from every other module. This means that whenever a module requires to either verbally notify the user or get the user’s feedback, Alana will handle this task. In this way the conversation throughout the interaction will be more contextually relevant, and easier to maintain. Since the robot needs to engage in social dialogue as well as to complete tasks, Alana was enriched with so-called task bots to conversationally execute and monitor behaviours on a physical agent (Figure 5[25].

In order to enable the functionality described above, a new Natural Language Understanding (NLU) module, HERMIT NLU, has been implemented and integrated into the Alana system, which is able to deal with social chit-chat but also extract the necessary information from commands to start tasks. HERMIT NLU [33] is thus used to decide if the task bot is triggered and to extract the required parameters for tasks, such as the name of the shop someone is looking for. While standard chatbots mostly rely on NLU that works on shallow semantic representations (e.g., intents + slots), task-based applications require richer characterisations. In line with [11], we promote the idea that the user’s intent can be represented through the combination of existing theories, capturing different dimensions of the overall problem, namely Dialogue Acts and Frame Semantics. Existing approaches to NLU for dialogue systems are based on formal languages designed around the targeted domain. However, it has been widely demonstrated that the generalisation capability of statistical-based approaches is more robust towards lexical and domain variability [2]

. We thus use a deep learning architecture based on a hierarchy of self-attention mechanisms and BiLSTM encoders followed by CRF tagging layers to perform multi-task learning over the aforementioned semantic dimensions 

[26]. The system effectively learns how to predict Dialogue Acts, Frames, and Frame Elements in a sequence labelling scheme, starting from a corpus of annotated sentences which we are currently developing.

Figure 5: Architecture of the Dialogue system. The blue parts on the left represent the task management and execution system, the green parts on the right represent Alana as the dialogue system. The Bot Ensemble contains social chat bots and the task bot which is able to trigger tasks and handle communication between the task and the user. The yellow middle part is the task specific dialogue management system (Arbiter), Text-To-Speech (TTS) and Automatic Speech Recognition (ASR).

After a task has been identified, executing it on a robot usually includes physical actions that require a finite amount of time to complete and are not instantaneous such as dialogue actions. While the robot is executing such an action, the user might want to continue the conversation, or give new instructions. In order to be able to support such a multi-threaded dialogue management of interleaving tasks with general chitchat and other tasks, we build on the ideas presented in [23]. To this end, the execution system introduced in [12] has been extended to use so-called recipes that define dialogue and physical actions to execute in order to achieve the given goal [25]. The execution framework described therein has been redesigned to support multi-threaded execution and an arbitration process has been put in place to manage the currently running tasks on the execution side and in the Alana system. This lets tasks be started, stopped, and paused at any time, with appropriate feedback to the user. If a task has been suspended by another action, it will be resumed after the new action finishes and any open questions will be re-raised to prompt the user.

Route guidance supervision

One of the core tasks for the MuMMER robot in the mall is the guiding of users to specific locations in the mall, by pointing at places and explaining the route to the wanted location. The task is triggered when a human asks for a location. A supervision system, based on Jason [4], a BDI agent-oriented framework handles the execution through Jason reactive plans. Throughout the task, the robot supervises the execution and, depending what goes wrong, the robot has multiple possible responses. For example, it is able to handle nominal scenarios of route guidance while being able to take into account contingencies such as the human lack of visibility of the direction, his/her ability or not to take the stairs, his/her understanding of the message, etc. Finally, if at some point, the human is not perceived during a certain time, the robot ends the task, assuming that the human has left.

Route computing and route verbalization

The entire description of the route, from the search for the best route to get to the final destination to the verbalization of this route, is based on the SSR (Semantic Spatial Representation) [29]. This representation is used to describe the topology of an indoor environment as well as semantic information (type of stores or items sold by stores) in a single ontology. This ontology is managed by Ontologenius [30], a lightweight open-source ROS-compatible package which stores semantic knowledge, reasons with it, and shares that information to all the other system components.

Geometric reasoning

Geometric reasoning uses Underworlds [22, 28], a lightweight framework for cascading spatio-temporal situation assessment in robotics. It represents the environment as real-time distributed data structures, containing scene graph (for representation of 3D geometries). Underworlds supports cascading representations: the environment is viewed as a set of worlds that can each have different spatial granularities, and may inherit from each other. It also provides a set of high-level client libraries and tools to introspect and manipulate the environment models. Based on a 3D model of the mall (Figure 6), it maintains what the robot knows about the scene as well as alternative world states. These states represent the estimation of the human’s beliefs about the scene It also provides the symbolic relations among entities with stamped predicates (e.g. [] or [] when speaks and looks at (given by perception)).

Figure 6: Visualization of the visibility grids of a landmark on the 3D model of the central square of the Ideapark shopping center.

Motion planning

The navigation of the robot is implemented using the ROS navigation stack, with navfn as the global planner and a Timed Elastic Band (TEB) [27] planner as the local planner. For MuMMER, the local planner was modified in order to accommodate humans into planning inspired from [21], resulting a new planner called and this new planner called Social TEB (S-TEB). This algorithm is able to plan and execute trajectories while ensuring satisfaction of robot kinematics constraints, avoiding static and moving non-human obstacles and planning navigation solutions respecting social constraints with humans perceived. The planner ensures the safety of humans by re-planning a local plan at each control loop.

SVP planner

Although the target robot location in the mall is in a large square, several elements of the environment can block the visibility of important landmarks for the proper understanding of the route to take. The purpose of the SVP (Shared Visual Perspective) planner [35] is therefore to try to find a position where the human will have to go in order to observe an element of the environment such as a passage, a staircase or a store. To do this, a visibility grid is computed for each possible landmark, as shown on figure 6. Having determine a good position for the human, the planer also allows to determine the good position for the robot so as to have a human-robot-landmark conformation allowing both to point the landmark and to look at the human.


A long-term deployment (three months from September 2019) will allow the study of the customer behaviours around a helpful and entertaining robot over an extended period of time. This section gives details on how the hardware and software that are being employed in the final deployment, as well as the scenarios that are supported.


The fully integrated MuMMER system consists of several hardware components to allow the computation to be performed on the appropriate platforms.

The robot we are using is an updated, custom version of the Pepper platform, which is equipped with an Intel D435 camera and a NVIDIA Jetson TX2 in addition to the traditional sensors that are found on the previous versions of the robot. We use the Robot Operating System (ROS) to enable the communication between the processing nodes. All the streams (audio, video, robot states) are sent to a remote laptop which performs all the computation. The laptop has a NVIDIA RTX 2080 graphics card (for the deep learning part) and 12 CPU cores. The perception algorithms process the Intel images at a resolution of for the detection and tracking parts, and at a resolution of for the re-identification part, which enables fast tracking and a good re-identification quality with OpenFace. The 4 microphone streams are processed at a frequency of  Hz, and the full perception system delivers the output at 10 fps.

To transcribe the user’s speech signal we use the Google Automatic Speech Recognition (ASR) API222 which receives an enhanced audio signal from a delay and sum beamformer based on the location of the speaking person determined by the audio-visual sensing. A dedicated ROS node streams the audio to the ASR which in real-time returns an incrementally updated string transcribing the utterance. Using silence to mark the end of speech, this transcription is enhanced using the context of the sentence to provide for a more coherent result. Finally, the text output of the Google ASR is sent to the Alana framework to perform the dialogue task, through the arbitration module as explained above.

The system is deployed in two languages, English and Finnish, though due to the vast linguistic differences between the two languages, the two versions have been kept separated, and the whole interaction can either be in one or the other. Due to the complexity of the NLU module in the Finnish version of the system, the user’s utterance is being translated into English using Google Translate API333 The result of this translation is sent to the Alana conversational framework and goes through the NLU pipeline described in detail in [9]. In the English version of the system, Alana then returns the reply to be verbalised. Due to the relatively poor performance of Google translate when it comes to translating English into Finnish (as remarked upon by our Finnish partners), the Finnish version of Alana has a much reduced set of bots in its ensemble (see Figure 5). These bots mainly return answers based on templates that have been translated into Finnish beforehand.

Scenarios deployed

As a proof of concept, a real-time autonomous system has been built to integrate all the components described in the sections above. The following types of interactions can be triggered by the user:


The staple of the interaction is social dialogue [9]. During all other modes of interaction, the user can always default to simply chat to the robot irrespective of whether it is currently executing a goal/task (e.g. the user requires guidance to a specific shop) or not. For example the user might approach the robot and start discussing various topics. At specific points throughout this conversation, the system might explain its capabilities to the user in order to recover from a conversational stalemate or to simply make them aware of the fact that it can also be helpful in finding your way around the mall (see below).


In this scenario, multiple choice questions are asked by the robot, and the human replies by stating the number of the answer they think is correct.

Route Description (Dialogue only)

When the human asks how to get to a specific shop, the robot gives him/her the route description. In this most “simple” form, the system uses only verbal interaction. This means that especially the route description is merely presented as a string of synthesised text.

Route Guidance (Dialogue + Pointing)

In this version, the robot guides the human to specific locations in the mall, by pointing at places and explaining the route to the wanted location. To do so, the robot first computes positions so that the human will be able to see what the robot is pointing for him/her. Then, the robot navigates to its position (this part is optional), expecting that the human will join it once it stopped and checks the human’s visibility. Then, the robot explains to the human how to reach the destination. According to a human-human guidance study [3], the robot points, first at the location direction and then points at the access point (a corridor, stairs or an escalator) to go through to reach the location. While pointing, the robot verbalizes the route. Finally, the robot checks that the human knows how to reach the goal and leaves open the possibility to repeat if needed. All along the task, the robot supervises the task and adapts accordingly.

All these modes of interaction can be interleaved at the user’s discretion. This means, for example, that during the quiz the user could revert to social dialogue. If they do so, the system might occasionally try to bring them back to the quiz by re-raising the last question. The same holds true if the person chooses to abandon a route guidance task before it was finished.


The MuMMER project has built a fully autonomous entertainment robot to perform HRI scenarios in a shopping mall, in which the main goal is to have entertainment interaction (quiz, chat), as well as route guidance. The system is real-time, by leveraging the heavy deep learning computation on a remote laptop, the ASR on the Google platform, and the Alana conversational AI system on a remote server. This system enables a natural interaction with the participants; it has been tested and was tested in real conditions for several short sessions, and as of September 2019 is fully deployed for a three-month long-term user study.

Further work for large scale deployment could include some software optimizations to run more components on the robot itself, and to reduce the lag which sometime exists between the human speech and the robot reply.


This research has been partially funded by the European Union’s Horizon 2020 research and innovation program under grant agreement no. 688147 (MuMMER,


  • [1] B. Amos, B. Ludwiczuk, and M. Satyanarayanan (2016)

    OpenFace: a general-purpose face recognition library with mobile applications

    Technical report CMU-CS-16-118, CMU School of Computer Science. Cited by: Audio-visual sensing.
  • [2] E. Bastianelli, D. Croce, A. Vanzo, R. Basili, and D. Nardi (2016) A discriminative approach to grounded spoken language understanding in interactive robotics. In

    Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence

    IJCAI’16, pp. 2747–2753. Cited by: Conversational interaction.
  • [3] K. Belhassein, A. Clodic, H. Cochet, M. Niemelä, P. Heikkilä, H. Lammi, and A. Tammela (2017-12) Human-Human Guidance Study. Technical report Technical Report 17596, LAAS. External Links: Link Cited by: item Route Guidance (Dialogue + Pointing).
  • [4] R. H. Bordini, J. F. Hübner, and M. Wooldridge (2007) Programming multi-agent systems in agentspeak using jason (wiley series in agent technology). John Wiley & Sons, Inc., USA. External Links: ISBN 0470029005 Cited by: Route guidance supervision.
  • [5] Y. Cao, O. Canévet, and J. Odobez (2018) Leveraging convolutional pose machines for fast and accurate head pose estimation. In IEEE/RSJ International Conference on Intelligent Robots and Systems, Cited by: Audio-visual sensing.
  • [6] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7291–7299. Cited by: Audio-visual sensing.
  • [7] B. Craenen, A. Deshmukh, M. E. Foster, and A. Vinciarelli (2018-08) Do we really like robots that match our personality? the case of big-five traits, godspeed scores and robotic gestures. In Proceedings of the 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Vol. . Cited by: Social signal generation.
  • [8] B. G.W. Craenen, A. Deshmukh, M. E. Foster, and A. Vinciarelli (2018) Shaping gestures to shape personalities: the relationship between gesture parameters, attributed personality traits, and Godspeed scores. In Proceedings of the 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 699–704. Cited by: Social signal generation.
  • [9] A. C. Curry, I. Papaioannou, A. Suglia, S. Agarwal, I. Shalyminov, X. Xu, O. Dušek, A. Eshghi, I. Konstas, V. Rieser, et al. (2018) Alana v2: entertaining and informative open-domain social dialogue using ontologies and entity linking. Alexa Prize Proceedings. Cited by: Conversational interaction, item Chat, Setup.
  • [10] A. Deshmukh, M. E. Foster, and A. Mazel (2019-10) Contextual non-verbal behaviour generation for humanoid robot using text sentiment. In Proceedings of the 28th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Cited by: Social signal generation.
  • [11] M. Dinarelli, S. Quarteroni, S. Tonelli, A. Moschitti, and G. Riccardi (2009-03) Annotating spoken dialogs: from speech segments to dialog acts and frame semantics. In Proceedings of SRSL 2009, the 2nd Workshop on Semantic Representation of Spoken Language, Athens, Greece, pp. 34–41. External Links: Link Cited by: Conversational interaction.
  • [12] C. Dondrup, I. Papaioannou, J. Novikova, and O. Lemon (2017) Introducing a ROS based planning and execution framework for human-robot interaction. In Proceedings of the 1st ACM SIGCHI International Workshop on Investigating Social Interactions with Artificial Agents, ISIAA 2017, pp. 27–28. External Links: Link, Document Cited by: Conversational interaction.
  • [13] M. E. Foster, R. Alami, O. Gestranius, O. Lemon, M. Niemelä, J. Odobez, and A. K. Pandey (2016) The MuMMER project: engaging human-robot interaction in real-world public spaces. In Social Robotics, Cham, pp. 753–763. Cited by: Introduction.
  • [14] H. Hastie, P. Dente, D. Küster, and A. Kappas (2016) Sound emblems for affective multimodal output of a robotic tutor: a perception study. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 256–260. External Links: Document Cited by: Social signal generation.
  • [15] W. He, P. Motlicek, and J. Odobez (2018-05) Deep neural networks for multiple speaker detection and localization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 74–79. External Links: Document, ISSN 2577-087X Cited by: Audio-visual sensing.
  • [16] W. He, P. Motlicek, and J. Odobez (2019-05) Adaptation of multiple sound source localization neural networks with weak supervision and domain-adversarial training. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 770–774. External Links: Document, ISSN 2379-190X Cited by: Audio-visual sensing.
  • [17] W. He, P. Motlicek, and J. Odobez (2018) Joint localization and classification of multiple sound sources using a multi-task neural network. In Proceedings of Interspeech 2018, pp. 312–316. External Links: Document, Link Cited by: Audio-visual sensing.
  • [18] P. Heikkilä, H. Lammi, and K. Belhassein (2018) Where can I find a pharmacy? - human-driven design of a service robot’s guidance behaviour. In Proceedings of PubRob 2018, Cited by: Introduction.
  • [19] P. Heikkilä, M. Niemelä, G. Sarthou, A. Tammela, A. Clodic, and R. Alami (2019) Should a robot guide like a human? a qualitative four-phase study of a shopping mall robot. In International Conference on Social Robotics (ICSR), Cited by: Introduction.
  • [20] V. Khalidov and J. Odobez (2017-02) Real-time multiple head tracking using texture and colour cues. Idiap-RR Technical Report Idiap-RR-02-2017, Idiap. External Links: Link Cited by: Audio-visual sensing.
  • [21] H. Khambhaita and R. Alami (2017-12) Viewing Robot Navigation in Human Environment as a Cooperative Activity. In International Symposium on Robotics Research (ISSR 2017), Puerto Varas, Chile, pp. 18p.. External Links: Link Cited by: Motion planning.
  • [22] S. Lemaignan, Y. Sallami, C. Wallbridge, A. Clodic, T. Belpaeme, and R. Alami (2018-10) UNDERWORLDS: Cascading Situation Assessment for Robots. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain. External Links: Link Cited by: Geometric reasoning.
  • [23] O. Lemon, A. Gruenstein, A. Battle, and S. Peters (2002) Multi-tasking and collaborative activities in dialogue systems. In Proceedings of the 3rd SIGdial Workshop on Discourse and Dialogue - Volume 2, pp. 113–124. External Links: Link, Document Cited by: Conversational interaction.
  • [24] I. Papaioannou, A. Cercas Curry, J. Part, I. Shalyminov, X. Xinnuo, Y. Yu, O. Dusek, V. Rieser, and O. Lemon (2017) Alana: social dialogue using an ensemble model and a ranker trained on user feedback. In 2017 Alexa Prize Proceedings, (English). Cited by: Conversational interaction.
  • [25] I. Papaioannou, C. Dondrup, and O. Lemon (2018) Human-robot interaction requires more than slot filling-multi-threaded dialogue for collaborative tasks and social conversation. In FAIM/ISCA Workshop on Artificial Intelligence for Multimodal Human Robot Interaction, pp. 61–64. Cited by: Conversational interaction, Conversational interaction.
  • [26] A. Rastogi, R. Gupta, and D. Hakkani-Tur (2018-07) Multi-task learning for joint language understanding and dialogue state tracking. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, Melbourne, Australia, pp. 376–384. External Links: Link Cited by: Conversational interaction.
  • [27] C. Rösmann, F. Hoffmann, and T. Bertram (2017) Integrated online trajectory planning and optimization in distinctive topologies. Robotics and Autonomous Systems 88, pp. 142–153. Cited by: Motion planning.
  • [28] Y. Sallami, S. Lemaignan, A. Clodic, and R. Alami (2019-10) Simulation-based physics reasoning for consistent scene estimation in an hri context. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019), Macau, China. Note: To appear Cited by: Audio-visual sensing, Geometric reasoning.
  • [29] G. Sarthou, R. Alami, and A. Clodic (2019-06) Semantic Spatial Representation: a unique representation of an environment based on an ontology for robotic applications. In SpLU-RoboNLP 2019, Minneapolis, United States, pp. 50 – 60. External Links: Link Cited by: Route computing and route verbalization.
  • [30] G. Sarthou, R. Alami, and A. Clodic (2019-10) Ontologenius : A long-term semantic memory for robotic agents. In RO-MAN 2019, New Delhi, India. External Links: Link Cited by: Route computing and route verbalization.
  • [31] S. Sheikhi and J.M. Odobez (2015-Nov.) Combining dynamic head pose and gaze mapping with the robot conversational sftate or attention recognition in human-robot interactions. Pattern Recognition Letters 66, pp. 81–90. Cited by: Audio-visual sensing.
  • [32] R. Siegfried, Y. Yu, and J.-M. Odobez (2017-11) Towards the use of social interaction conventions as prior for gaze model adaptation. In 19th ACM International Conference on Multimodal Interaction (ICMI), Glasgow. Cited by: Audio-visual sensing.
  • [33] A. Vanzo, E. Bastianelli, and O. Lemon (2019-09) Hierarchical multi-task natural language understanding for cross-domain conversational ai: HERMIT NLU. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Stockholm, Sweden, pp. to appear. Cited by: Conversational interaction.
  • [34] A. Vinciarelli, M. Pantic, and H. Bourlard (2009-11) Social signal processing: survey of an emerging domain. Image and Vision Computing 27 (12), pp. 1743–1759. External Links: Link, Document Cited by: Social signal generation.
  • [35] J. Waldhart, A. Clodic, and R. Alami (2019-10) Reasoning on Shared Visual Perspective to Improve Route Directions. In 2019 28th IEEE International Conference on Robot & Human Interactive Communication, New Delhi, India. External Links: Link Cited by: SVP planner.