Quadrupedal Robotic Guide Dog with Vocal Human-Robot Interaction

by   Kavan Mehrizi, et al.
berkeley college

Guide dogs play a critical role in the lives of many, however training them is a time- and labor-intensive process. We are developing a method to allow an autonomous robot to physically guide humans using direct human-robot communication. The proposed algorithm will be deployed on a Unitree A1 quadrupedal robot and will autonomously navigate the person to their destination while communicating with the person using a speech interface compatible with the robot. This speech interface utilizes cloud based services such as Amazon Polly and Google Cloud to serve as the text-to-speech and speech-to-text engines.



page 2


Robotic Guide Dog: Leading a Human with Leash-Guided Hybrid Physical Interaction

An autonomous robot that is able to physically guide humans through narr...

Theory of Robot Communication: II. Befriending a Robot over Time

In building on theories of Computer-Mediated Communication (CMC), Human-...

Human-Robot Interface to Operate Robotic Systems via Muscle Synergy-Based Kinodynamic Information Transfer

When a human performs a given specific task, it has been known that the ...

Sample-Efficient Training of Robotic Guide Using Human Path Prediction Network

Training a robot that engages with people is challenging, because it is ...

A Benchmarking on Cloud based Speech-To-Text Services for French Speech and Background Noise Effect

This study presents a large scale benchmarking on cloud based Speech-To-...

Vision-Guided Robot Hearing

Natural human-robot interaction in complex and unpredictable environment...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The training and maintenance of a traditional guide dog presents challenges to the elderly, frail, and visually-impared. Each guide dog has to be trained individually in a time and labor intensive process and the skills gained from one dog cannot be implemented into another dog. In addition, guide dogs may get ill or need to retire, which creates a hassle of getting a replacement dog, which may not be a good match for the user [1]. An autonomous robot that could lead people in need of assistance through a multi-floor building would ease the burdens that come with a traditional guide dog. Most previous robotic guides are bunglesome and are limited to maneuvering in narrow and complex spaces due to their bulky size or rely on physical interaction between the robot and the user, by having them physically hold a leash or rigid arm, without any way for the user to verbally give commands such as to reroute, or stop the robot [2]–[4]. In addition, none of these guide robots are able to guide and navigate in multifloor situations. In early 2021, Xiao et al. successfully implemented a robotic quadrupedal robot to guide a subject, however the model relied solely on physical interaction based around a leash and had no way for the person being led to directly communicate to the robot [5]. A small, quadrupedal robot that is both able to directly communicate and listen for commands from the person that is being guided as well as having a leash would solve such issues. We seek to accomplish this by utilizing a Unitree A1 quadrupedal robot [6] to autonomously navigate a visually-impared person in a multi-floor environment by creating algorithms that would allow for a custom wake-up word and communicate with the user via text-to-speech (TTS) and speech-to-text (STT) cloud services.

Ii Methodology

The robot would be able to vocally communicate with and understand the user using text-to-speech and speech-to-text algorithms. We had to first find basic open source code

[7] that allowed for the integration of Amazon Polly, a cloud service, that allows the robot to speak to the user directly by sending a string of text to Amazon Web Services, which submits that text to Amazon Polly to generate an audio stream. That audio stream is then retrieved from Amazon Polly which is then played through an installed speakerphone on the robot. We then had to make the code compatible with the robot’s infrastructure, which relies on Robot Operating System (ROS). For the robot to understand what the user is saying, we are using Google Cloud and their Speech-to-Text Application Programming Interface (API). Google Speech-to-Text API works by getting audio data from a source, which then runs the audio to convert into a digital line of text. In order to utilize this API, we found open source code from GitHub that is compatible with ROS and configured into the robot’s infrustature [8]. We gain audio data from the speakerphone on the robot for use with Google Cloud. That string of text is then returned to the STT algorithm, which will look to see if the wake word, which is customizable, has been said. If not, the algorithm ignores whatever was said and will resume to listen. When the wake-up word is said, the string is sent to a word dictionary function that will search for keywords in the resulting text and has preset coordinates based on those keywords. The algorithm then publishes those coordinates to the navigation goal node after understanding where the user wants to go. STT will also publish a string of text to TTS to allow for the robot to respond back to the user. The robot’s navigation subscribes to that STT publisher and creates a path to the target point.

Iii Results

We tested the speech interface in simulation using a simulated navigation map that the robot would map out using its onboard lidar camera shown in Figure 1. In this simulation, the user said to the robot, ”Hey A1, take me to the lab.” The speech interface successfully heard the user’s command and translated the user’s command into a string of text. It then published the pre-set coordinates of the laboratory from the dictionary to the navigation goal node. The robot’s navigation was able to subscribe to that node and created a path to that goal location shown in Figure 2. Finally the robot responded back to the user saying, ”Okay, navigating to the lab.” The user then said, ”Take me to the office.” The speech interface successfully ignored the speech even though it could be a command as the user did not use the wake-up word, which was set to, ”Hey A1.” The robot’s navigation was not affected and no response back was given. It was only when the user said the same sentence but with the wake-up word that the algorithm recognized it as a valid command. This meant that the speech interface sent the coordinates of the office to the robot’s navigation pipeline, which resulted in creating a new path shown in Figure 3.

Fig. 1: Simulated navigation map showing initial position in the purple circle.
Fig. 2: Navigation map after receiving navigation goal coordinates from speech interface. The green figures portray the final position of the user and robot, while the green line is the path created to that final position goal.
Fig. 3: Navigation map after receiving new navigation goal coordinates from speech interface.

Iv Discussion and Future Work

These results prove that the TTS and STT engines were able to be integrated with the algorithm created. The algorithm was able to communicate to both the engines and the robot’s infrastructure. The robot ignored all irrelevant speech, only sending the string of text to the dictionary function when the wake-up word was said. Unlike previous robots with a speech interface, we are able to have a custom wake-up word and don’t rely on an Amazon Echo device [2]. We were able to successfully set a navigation goal solely by verbally communicating a command to the robot. Our previous work, while having a leash, relied on an external computer to input commands, not allowing the user themselves to communicate with the robot [5]. This work improves the user experience by allowing for explicit interaction, not just implicit interaction by the use of a leash.

We currently are further developing the guide dog robot to operate an elevator to allow for multi-floor navigation. In order to facilitate multi-floor navigation, we are currently restructuring the robot’s navigation to take floors into consideration. Having a multi-floor situation means that we need to further develop the speech interface to send coordinates that can relate to what floor level the navigation goal is at. The speech interface will be developed to allow for more commands such as telling the robot to stop at its current position as well as giving the user instructions when needed. We need to further optimize the speech interface such as making it easier to input new commands and new locations into the algorithm.

V Conclusion

Having a speech interface makes it simpler for the user to send commands to the robot. We developed and tested a successful speech interface algorithm that is able to communicate with the TTS and STT engines as well as communicate with the robot’s navigation pipeline. The main advantages of this work are that we are able to customize the wake-up word due to having our own proprietary speech interface and are able to create custom commands fairly easily by adding them to the word dictionary. We are able to have a custom wake-up word and are able to integrate this speech interface with a leash while using a maneuverable robot.


This work was supported by the Hopper Dean Foundation and National Science Foundation Award #1757690. Transfer-to-Excellence program is sponsored by the National Science Foundation and the Center for Energy Efficient Electronics (NSF #0939514). I would like to thank my mentor, Zhongyu Li, for his guidance and support throughout this experience. I also want to thank my Principal Investigator, Koushil Sreenath, for giving me the opportunity to be apart of his research group. I would also like to thank Nicole McIntyre, Tony Vo Hoang, Sam Mountain, Gary Yang, and the Hybrid Robotics Group for their constant support.