Log In Sign Up

Smart speaker design and implementation with biometric authentication and advanced voice interaction capability

Advancements in semiconductor technology have reduced dimensions and cost while improving the performance and capacity of chipsets. In addition, advancement in the AI frameworks and libraries brings possibilities to accommodate more AI at the resource-constrained edge of consumer IoT devices. Sensors are nowadays an integral part of our environment which provide continuous data streams to build intelligent applications. An example could be a smart home scenario with multiple interconnected devices. In such smart environments, for convenience and quick access to web-based service and personal information such as calendars, notes, emails, reminders, banking, etc, users link third-party skills or skills from the Amazon store to their smart speakers. Also, in current smart home scenarios, several smart home products such as smart security cameras, video doorbells, smart plugs, smart carbon monoxide monitors, and smart door locks, etc. are interlinked to a modern smart speaker via means of custom skill addition. Since smart speakers are linked to such services and devices via the smart speaker user's account. They can be used by anyone with physical access to the smart speaker via voice commands. If done so, the data privacy, home security and other aspects of the user get compromised. Recently launched, Tensor Cam's AI Camera, Toshiba's Symbio, Facebook's Portal are camera-enabled smart speakers with AI functionalities. Although they are camera-enabled, yet they do not have an authentication scheme in addition to calling out the wake-word. This paper provides an overview of cybersecurity risks faced by smart speaker users due to lack of authentication scheme and discusses the development of a state-of-the-art camera-enabled, microphone array-based modern Alexa smart speaker prototype to address these risks.


page 1

page 2

page 3

page 4


The Smart^2 Speaker Blocker: An Open-Source Privacy Filter for Connected Home Speakers

The popularity and projected growth of in-home smart speaker assistants,...

Implementation of Google Assistant Amazon Alexa on Raspberry Pi

This paper investigates the implementation of voice-enabled Google Assis...

Preliminary Study of a Google Home Mini

Many artificial intelligence (AI) speakers have recently come to market....

A Microphone Array and Voice Algorithm based Smart Hearing Aid

Approximately 6.2 disabling hearing impairment [1]. Hearing impairment i...

Making Privacy Graspable: Can we Nudge Users to use Privacy Enhancing Techniques?

Smart speakers are gaining popularity. However, such devices can put the...

Alexa as an Active Listener: How Backchanneling Can Elicit Self-Disclosure and Promote User Experience

Active listening is a well-known skill applied in human communication to...

Competitive Wakeup Scheme for Distributed Devices

Wakeup is the primary function in voice interaction which is the mainstr...

1 Introduction

The recent advancements in technology (particularly IoT and AI) are having a great impact on our day to day life [sudharsan2020avoid] [sudharsan2022ris] [sudharsan2020adaptive]. In a smart home scenario, multiple smart devices are interlinked and work in collaboration with each other to serve a common goal [sudharsan2020edge2train] [sudharsan2021toward] [sudharsan2021globe2train]. Smart speakers are one amongst such smart devices that are being widely adopted by common users and becoming an integral part of smart homes. The AI assistants inbuilt within the recent smart speakers can understand voice-based commands and control complex integrated systems of a smart home. While voice-based commands provide an easy mechanism to interact with complex systems, they also introduce a security risk in terms of handing over control of systems to any user who has access to the smart speaker and can deliver voice-based commands. There is a strong need to introduce bio-metrics based authentication mechanisms for smart speakers to strengthen the security of integrated systems without compromising the rich user experience. Due to the lack of reliability in the existing voice authentications system, the ideal solution is to introduce additional authentication techniques.

When a person claims to be the registered smart speaker user, there is a need to provide a factor to prove ”the user is who she says she is”. This factor can be providing the authentication system of the smart speaker with something the user knows (pin or password), or use something the user has (physical token) or something the user is (biometrics). Biometric authentication is best suited since authentication is a part of the user which makes the authentication process of smart speaker hands-free. Voice authentication analyzes the user’s voice to verify identity based on the user’s unique vocal attributes. Voice authentication is ideal for hands-free usage of standalone devices like smartphones, smart speakers and voice-based systems in an automobile since its integration is cost-effective, familiar and convenient for most users, less invasive (contactless) and more hygienic. But its downsides are, it is not as accurate as other biometric modalities [awarebiometrics_2018_voice], requires additional liveness detection system and background noise impacts the voice matching performance [awarebiometrics_2018_voice]. Biometric authentication solutions such as Knomi [awarebiometrics_2017_knomi] provide a family of biometric matching and liveliness detection algorithms that use both face & voice for authentication. Likewise Sensory’s TrulyHandsfree [a2014_trulysecure]

uses proprietary face, voice recognition, and biometric fusion algorithms leveraging computer vision, speech processing and machine learning algorithms to provides on-device, almost instantaneous authentication. SDK’s of such multi-modal authentication systems are suited to build applications for smartphones and tablets and not for smart speakers because of its low hardware specifications

[sudharsan2021ml] [sudharsan2021porting] [sudharsan2021sram].

Recently launched, Tensor Cam’s AI Camera, Toshiba’s Symbio, Facebook’s Portal are camera-enabled smart speakers with AI functionalities [sudharsan2020rce] [sudharsan2021tinyml] [sudharsan2021enabling]

. Although they are camera-enabled, yet they do not have an authentication scheme in addition to calling out the wake-word. The modern Alexa smart speaker discussed in this paper is constructed using off the shelf hardware components (Raspberry Pi, ReSpeaker v2, Raspberry Pi camera, regular speaker). A biometrics-based authentication system for such Alexa smart speakers is designed by adding a camera module and introducing face recognition algorithms. This face recognition algorithm was trained using Deep Neural Network which can detect and identify human face for authentication. Additionally, it was able to identify and recognize faces during the human gaze, thus waking up Alexa only when a known face is recognized. To provide a seamless, full-duplex user-Alexa interaction, a microphone array with an on-board chip hosting DSP based speech algorithms was selected and used to capture, process and provide a noise suppressed voice feed to Alexa. Our proof of concept prototype demonstrates a rich user experience to interact with smart speakers by providing an extra layer of authentication and also facilitating improved voice interaction with the device.

2 Cybersecurity risks due to lack of authentication schemes in smart speakers

Users start interacting with a regular Alexa smart speaker by waking up the Alexa AI voice assistant by calling out the “Alexa” wake-word, followed by regular dialogues based interaction. In the current scenario, a few Alexa devices support voice profiles [a2019_amazoncom] to provide a personalized interaction experience with its supported features. For this, the user has to train Alexa using voice, followed by linking the trained voice with a corresponding Alexa user account. But, this feature is only a voice-based user identification rather than authentication. Firstly, this existing voice-biometric feature is limited to a few Alexa supported features and does not act as a voice biometric authentication method for the whole smart speaker system. Secondly, it is proven that a similar voice might be able to fool Amazon and Google’s voice recognition [gebhart_2017_fooling] and also Google warns saying the fact that similar voice might be able to access your info while the user is setting up voice recognition for the first time. According to a guide to the security of voice-activated smart speakers, for example, an ISTR Special Report published in 2017 [a2017_a] and other similar research articles, the following are a few cybersecurity risks that the smart speaker user can get exposed to in the absence of user authentication scheme.

  1. [label=.]

  2. The curious child attack: There is always a risk that a child can make a purchase via voice commands from the smart speaker without the knowledge of the linked account owner

  3. Mischievous neighbor’s tale: If a neighbor wants to cause mischief. She/he could send commands to the smart speaker in ultrasonic frequencies which cannot be heard by humans but can be detected by smart speakers.

  4. “This parrot keeps trying to buy food by speaking to Alexa” [charlton_2018_this]: A parrot managed to successfully add items such as strawberries, light bulb, and kettle, etc. to owner’s online shopping cart. Such activities could be avoided by using a pin, but the parrot could potentially learn and repeat the pin too.

  5. Talking television troubles: Simply watching television or listening to the radio can Wake-up and interact with the smart speaker.

  6. Physical access:Anyone proximate to the smart speaker can wake it up, interact and extract information from the actual user’s calendar, reminder’s and other linked applications

  7. Biometric-override attack [feng_2017_continuous]: An attacker can inject voice commands [panjwani_crowdsourcing] by replaying the previously recorded clip of the victim’s voice, or by impersonating the victim’s voice.

  8. Malicious commands: Someone can generate malicious commands, which can be heard as garbled sounds by human ears, while the smart speakers interpret them as commands. Such commands can be embedded in online videos or TV advertisements to attack devices [alanwar_2017_echosafe]. As smart speakers are always listening, they are susceptible to such security attacks [sudharsan2021edge2guard] by devices which can generate malicious voices. Audio from television news triggered Amazon Echo to place orders for dollhouse [liptak_2017_amazons]

To address these issues, one possible existing method is to provide voice biometrics-based authentication for crucial third party applications such as calendars, email, banking, etc which are linked to Alexa. This can be done by integrating a third-party voice biometric API such as ArmorVox [a2019_welcome]. By doing so, the raw voice file captured by smart speakers should be exposed to the third-party API, which certainly could cause the emergence of privacy and data security challenges in the future. Again these approaches do not provide an authentication method for the whole smart speaker system and still leaves the system exposed to risks. Unfortunately, since most state-of-the-art smart speakers do not have authentication methods, they are mostly ineffective in alleviating the mentioned issues. The prototype developed and described in this paper interacts with Alexa API by providing noise suppressed audio feed captured from a Microphone array and in addition, it is capable of performing Biometrics (facial recognition) based system wakeup in addition to calling out the Alexa wake-word. The importance of Biometrics-based authentication for smart speakers was discussed in this session and the development of such biometrics enabled smart speaker prototype will be discussed in upcoming sessions.

3 Related Work

VAuth [feng_2017_continuous] is proposed for continuous authentication of voice assistants to defend against the threats caused due to the open nature of the smart speaker’s voice channel. VAuth is a separate embedded system that is adopted on wearable devices, such as eyeglasses, earphones/buds, and necklaces. This system senses the body-surface vibrations of the user and matches it with the speech signal received by the voice assistant’s microphone. Although VAuth achieved 97% detection accuracy, it is not a feasible solution to charge, maintain and carry this separate embedded system attached to the body of the user just to authenticate a smart speaker. Daon’s IdentityX [a2019_daon] is a multi-modal, vendor agnostic identity services platform that provides additional biometrics-based authentication using a smartphone only while using financial services apps via Alexa. This process involves the use of a secondary gadget (smartphone) and still not an authentication scheme for the entire smart speaker system which leaves Alexa exposed to risks discussed in Section 1. EchoSafe [alanwar_2017_echosafe] is a sonar-based defense against the attacks which occur due to malicious voice commands from nearby devices during user unoccupied periods. Here, when the user sends a critical command to the smart speaker, an audio pulse is sent from the smart speaker followed by post-processing to determine if the user is present in the room. They have claimed EchoSafe system can detect the user’s presence during critical commands with 93.13% accuracy. EchoSafe is a solution only for attacks via malicious voice commands and not suited for other vulnerabilities.

4 Overview of biometric authentication and speech algorithm based smart speaker

The first objective of this work is to provide a face biometrics-based authentication scheme for the entire smart speaker system. To perform this, a camera module is added to the smart speaker prototype as shown in Fig. 2. The lack of authentication schemes in regular smart speakers provides an open door to access user’s private information by anyone in its vicinity. Since the prototype discussed in this paper has a camera module and also is equipped with face recognition based Alexa wakeup scripts, it provides an extra layer of authentication. As shown in Fig. 1, the registered user has to first gaze at the camera to authenticate the system, then call out the Alexa wake word and start the regular dialogue-based interaction with Alexa. Section 4.3 discusses the algorithms involved to wake up the system when known face gazes at the prototype. The second objective is to capture and provide high-quality noise suppressed voice input to Alexa for achieving a seamless, full-duplex user-Alexa speech interaction. To perform this, the ReSpeaker v2 microphone array is used here rather than a single microphone since it can segregate speech from noise. Also, this mic-array has an inbuilt high-performance processor loaded with on-chip advanced DSP (Digital Signal Processing) based speech algorithms which enables users to interact with Alexa up to five meters or further from the smart speaker, interact while walking around the room, etc. This mic array’s role and benefits of using it to capture, process and provide voice input for Alexa are discussed in Section 4.2. The third objective is to improve the user experience by making sure the smart speaker is not activated accidentally when wake-word is not called out and also make sure the Alexa wake word is spotted from the input audio streams with high accuracy. To perform this a third-party wake word engine as discussed in Section 4.5 is integrated with the Alexa Voice Service C++ SDK as discussed in Section 4.4.

Figure 1: High-level system diagram of Alexa smart speaker prototype with biometrics-based wakeup

4.1 Hardware components of the smart speaker prototype

This modern smart-speaker prototype is constructed using commercial off the shelf advanced microphone array with inbuilt DSP (ReSpeaker v2), camera module (Raspberry Pi camera), and a regular speaker interfaced to a single board computer as shown in Fig. 1.

Figure 2: Hardware prototype of smart speaker system
  1. [label=.]

  2. Selection of Single Board Computer: BeagleBone Black, Orange Pi 3, LattePanda 2G/32G and Banana PI M4 are the SBC’s of our interest. The Raspberry Pi 3 model b+ (Pi 4 not yet released) is chosen considering its form factor, price-performance balance, low-power consumption, compatibility with off the shelf devices, community-created guides, tutorials, and support. As illustrated in Fig. 1, python scripts written leveraging external libraries are deployed on this Raspberry Pi Linux SBC. The scripts deployed here are responsible for waking up the Alexa Sample App when a known face is recognized from live frames captured

  3. Selection of camera unit: For real-time computer vision applications, the Raspberry Pi Camera V2 is preferred since it is capable of 1080p 30fps video encoding and 5MP stills quality. Since the camera is connected directly to the GPU via CSI connector as shown in Fig.1, there is only a little impact on Pi’s CPU, leaving it available for other processing. Most cost-effective web cameras do not have a built-in encoding like the Pi camera. Hence, web cameras use additional CPU resources causing the reduced overall performance of the system

  4. Selection of microphone array:

    Microphone is the crucial part of a smart speaker system. Since we require pre-processing of sound using speech algorithms, the focus is on a microphone array with built-in advanced DSP algorithms. ReSpeaker v2, Matrix Creator, PS3 eye, Conexant 4-mic development kit, MiniDSP UMA-8, Microsemi AcuEdge ZLK38AVS are the microphone arrays of our interest. ReSpeaker v2 has a good success rate for hot word detection when the distance is increased and tested in a silent room, a room with white noise and room with background music

    [rouchon_2017_benchmarking]. The PS3 Eye has the edge over ReSpeaker v2, but ReSpeaker v2 is chosen for this project because the Raspberry Pi camera with CSI interface has better support for Open CV environment than the PS3 Eye camera. The second reason is, the ReSpeaker has a Pixel Ring of 12 RGB LED’s which can be used for visual feedback in addition to the speaker unit.

4.2 Speech algorithms based microphone array for advanced voice interaction capability

The firmware on the XVF-3000 Chip (present on the ReSpeaker v2 hardware) produces six-channel mic outputs via USB to the Linux system. Channel zero contains audio which is processed using advanced DSP algorithms. Channels one to four contains raw data from the microphones corresponding to the channel number. Channel five provides raw audio which is a combination of all raw audio signals from four microphones on the ReSpeaker v2. A high-level illustration of this mic array’s role is shown in Fig.3. Here, the audio feed from channel zero is used for wake word spotting and also fed as voice input to the Alexa. The benefits of using ReSpeaker v2 with Alexa are listed below.

Figure 3: Flow diagram to illustrate microphone array’s role: Capture and provide speech algorithm processed voice feed to Alexa Voice Service
  1. [label=.]

  2. Far-field voice capture: Wake-up and interact with the smart speaker by capturing and processing raw microphone inputs at distances of up to five meters or further.

  3. USB Audio Class 1.0 (UAC 1.0): USB audio is used to send digital music from the Raspberry Pi to the digital to analog converter (DAC) inbuilt on the ReSpeaker v2. Class 1.0 can send up to a maximum of 24- bit/96kHz hi-res files. By utilizing this we can bypass the internal sound-card of Raspberry Pi and allow the USB DAC to play audio response from Alexa with much better quality.

  4. Twelve programmable RGB LED Pixel-ring: The RGB LED pixel ring on the ReSpeaker v2 is utilized to visually point the direction of speech signal arrival (source). Pixel ring library is used to address the LED pixels via the USB interface to change color and brightness according to requirements from the main program.

  5. Digital Signal Processing algorithms on ReSpeaker V2:

    1. [label=.]

    2. Beamforming: All MEMS microphones have an omnidirectional pickup response. It means, their response is the same for sound coming anywhere from around the microphone. Directional response or a beam pattern can be formed by configuring multiple microphones in an array. Thus, enabling us to detect and track the position of the voice of the smart speaker user across the room. As the smart speaker user interacts with the smart speaker and walks around the room, the angle of the microphone beam adjusts automatically to track their voice. Hence, it is effectively possible to point towards the user’s direction and suppress noise or reverberation signals from other directions.

    3. Noise suppression: In acoustic beamforming, the spatial relationship of the microphones in the microphone array achieves active microphone noise suppression and control. If the direction of the sound source relative to the microphone array is known, then an acoustic beamformer can be designed to pass signals coming from the sound source of interest and filter out sound signals picked up from other different directions. This approach to microphone array noise reduction is most applicable to a situation in which one person’s voice needs to be heard when multiple people are talking. Noise suppression removes the stationary (point-noise) and non-stationary background sounds.

    4. De-reverberation: In any room, one’s voice will reverberate (reflect) off hard surfaces around the room, e.g. a window or TV screen. De-reverberation removes these reflections and cleans up the voice signal.

    5. Acoustic Echo Cancellation: While interacting with electronic devices, in some cases, users hear their voice (sometimes with a significant delay). This experience is known as an acoustic echo. Controlling and canceling acoustic echo is essential for voice-based systems such as smart speakers. For example, if the smart speaker user is watching a film on a TV with minimal volume and simultaneously gives voice input to the smart speaker, now the microphones will capture both the user’s voice and the sound of the film (the acoustic echo). This acoustic Echo is canceled from the voice input so that text from captured audio can be extracted with better accuracy.

    4.3 Biometric authentication based Alexa wakeup

    As illustrated in Fig. 5. When the face recognition script is run, faces are detected from the live frames captured from Pi camera and a 128-d face embedding is computed via a deep metric network for the detected face. Then the computed 128-d face embedding is compared with a known database of already computed face encodings of registered faces to successfully recognise faces from the live frame. Once a known face is recognised, this script wakes up Alexa and simultaneously the ReSpeaker’s RGB LED Pixel-ring provides visual feedback to the user by turning green. Before running this face recognition script. A sub-script as shown in Fig. 4

    has to be run in order to encode 128-d vectors for faces in the dataset (directory with .jpg files of faces) & store the encodings in a .pickle file, which is later used as a database (while running main face recognition script) to compare detected faces from live frames and check for a match. Since Pi has limited computation power, memory & GPU, its resource has to be left free for other scripts to run. Hence more powerful algorithms such as Eigenfaces and LBPs (Local Binary Patterns) which can achieve frame rates greater than 10 FPS was not used.

    Figure 4: Flow diagram for preparing database of faces
    Figure 5: Flow diagram of face recognition script

    4.4 Alexa Voice Service C++ SDK

    Parallel to setting up the smart speaker hardware and deploying the scripts, the Pi has to be registered as a device at Amazon developer console & a security profile has to be created. Follow the detailed step-by-step instructions for Cloud side setup at [alexa_2019_alexaavsdevicesdk] and provide the path of the generated config.json while building the Alexa AVS Sample App from its SDK. This C++ libraries based AVS device SDK enables us to integrate Alexa into the smart speaker prototype. The interaction of smart speaker with AVS is performed using this Alexa Sample App which is built for Raspberry Pi from the official SDK. Before proceeding with Alexa AVS C++ SDK, Python version of Alexa Voice Service app [respeaker_2018_respeakeralexa] was tested with Raspberry Pi & ReSpeaker v2. Following results were observed.

    1. [label=.]

    2. After interacting with Alexa for quite some time, Alexa’s voice turned blurred & muffled. It gets resolved only after restarting the Pi

    3. After spotting the Alexa wake-word, there is a delay for a minimal duration (approx. 0.5 seconds) only after which audio is streamed to Alexa cloud

    As mentioned in [alexa_2019_alexaavsdevicesdk] and in Fig.6 multiple components comprise the C++ AVS SDK through which the audio data flows. Initially, signal processing algorithms are applied to input and output audio channels to produce processed, clear audio. If the raw audio data from four microphones of ReSpeaker are provided as input then this third-party Audio Signal Processor combines and provides a single audio stream to the next component in the architecture. But here, we are already providing a single channel audio stream which is processed by the DSP on the XVF-3000 chip of ReSpeaker v2. The remaining subparts of the architecture perform its functionality as mentioned in [alexa_2019_alexaavsdevicesdk] and Fig.6. Snowboy from and Sensory[a2014_home] are two third-party wake word engines, either one of which has to be a part of the SDK build to spot the Alexa wake-word from the input streams to provide hands-free interaction. Both these engines were tested with this ReSpeaker v2 based Alexa smart speaker setup. Snowboy, wake work engine was selected and used as a plugin for building the Alexa AVS sample app since it only consumes less than 8 % of Raspberry Pi’s CPU and had a better success for wake word detection.

    Figure 6: Data flow between components of AVS Device C++ SDK [alexa_2019_alexaavsdevicesdk]

    4.5 Snowboy wake-word engine to spot Alexa wake-word

    Snowboy engine ensures the smart speakers are not activated accidentally when wake-word is not called out. The accuracy of the wake word detection engines is measured by plotting false alarm per hour (a number of false positives) vs miss detection rates (percentage of wake word utterances an engine rejects incorrectly). The ROC curves of four different wake-word detection engines is shown in Fig.7. Here, Snowboy wake-word engine has the lowest miss detection rate and it is more accurate than other engines. Reasons for integrating wake-word engines with voice based AI assistants and smart speakers [alirezakenarsari_2018_yet].

    1. [label=.]

    2. Privacy: Microphones does not have to listen always

    3. Cost: Impractical & expensive when data is streamed to cloud all the time

    4. Power consumption: Voice assistants are run on smartphones, wearables & smart speakers where maximum standby time is expected

    Figure 7: ROC curves for popular wake-word engines [picovoice_2018_picovoicewakewordbenchmark]

5 Conclusion and Discussion

Over 50% of Irish population are expected to own a regular smart speaker by 2023 [taylor_2019_over] and its predicted that smart speaker ownership will overtake that of tablets globally by 2021 [taylor_2019_over]. Likewise, soon camera-enabled smart speakers will replace regular smart speakers and become an integral part of our daily life. This paper provided an overview of cybersecurity risks faced by smart speaker users due to lack of authentication scheme and discussed the development of a state-of-the-art camera-enabled, microphone array based modern Alexa smart speaker prototype to address these risks. In addition to this biometrics based system wake up and microphone array-based interaction. Since this smart speaker prototype is a camera-enabled Linux-based system, it is capable to host custom skills which can perform audio processing and computer vision based tasks when requested by the user. We plan to extend our existing work for multiple use-cases requiring voice commands such as smart enterprises (online meetings) and other digital voice assistants [sudharsan2019ai] [sudharsan2019microphone] [sudharsan2021owsnet].


This publication has emanated from research supported by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/16/RC/3918 (Confirm), and SFI/12/RC/2289_P2 (Insight) co-funded by the European Regional Development Fund.