The payments industry has witnessed the introduction of several innovative payment solutions in the last decade. Examples include contactless payment cards , smartphone and smartwatch payment applications (, Apple Pay , Google Pay, Samsung Pay ), and mobile terminals (, Square terminal ). While cash and cards remain the two most prominent payment methods, such innovative solutions are being increasingly created and used for conducting in-store and online payments . A relatively new focus area for the payment industry is in-vehicle payments. For in-vehicle payments, the goal is to enable the driver (and/or passengers) to pay in a seamless and secure manner.
I-a Existing Solutions
Several vehicle manufacturers have integrated digital assistants, such as Siri and Alexa, in infotainment systems that can be used for making voice-based in-vehicle payments . In fact, Amazon recently piloted in-vehicle payments using Alexa . In addition, some newer vehicles are “connected” and have integrated payment capabilities in the head-unit . Currently, such systems are closed-loop and typically only enable payments for the owner of the vehicle (and not for passengers in an open loop manner). Also, these systems do not enable owners of older vehicles that lack these advanced capabilities to conduct in-vehicle payments. Alternatively, smartphone apps can be used for making in-vehicle payments. While they are a viable option for passengers, smartphone usage can be quite distracting and is not recommended for drivers.
I-B Proposed Solution
Given the limitations of existing payment solutions, we design an open loop system that uses face and voice biometrics for enabling in-vehicle payments (Fig. 2). The users of the system enroll their face and voice templates once in a mobile app, and then pair their mobile device with a plug-and-play device (i.e., a dashcam) that is mounted in the vehicle. To initiate payment, a passenger invokes the dashcam with a trigger phrase, for example, “Hey DashCam” and then issues a command, such as “Pay for gas at pump 5”. On hearing the trigger phrase, the dashcam takes a picture of the passengers sitting in the vehicle using the in-cabin camera, as well as an audio recording of the command. The dashcam then initiates a privacy-preserving biometric comparison protocol with each of the connected mobile devices of the driver and/or passengers. Both face and voice biometrics are used to determine which passenger issued the command and wants to pay. If a unique payer is determined, payment is initiated via the mobile device of the payer.
We collected data from 20 different subjects at two different sites using a commercially available dashcam (Fig. 1), and evaluated open-source biometric recognition algorithms to show the feasibility of our system. We also developed an Android-based prototype of our system using open-source software packages to show the utility of DashCam Pay for facilitating in-vehicle payments.
I-C Payment Use Cases
The proposed system can be useful in a variety of payment scenarios (Fig. 3). Below, we discuss the benefits of our system for key use cases.
Cash or cards are still heavily used at drive-throughs for making payments. The typical process can be quite inconvenient given that the driver has to reach for their wallet to retrieve cash or a payment card, and hand this payment to a merchant representative. Some merchants, such as Starbucks, are using payment models where users pay via app. However, these apps may still require the customer to present their mobile device to an employee. In our system, the customer simply uses a voice command and this additional step is eliminated.
I-C2 Toll booths
Road tolls are assessed by stopping at a booth, by registration of a radio-transmitting device placed in the vehicle, or by mail after a camera at the toll road takes a photo of the vehicle’s license plate. With DashCam Pay, the driver could make a payment while approaching the toll booth using a voice command. The license plate number is registered in the dashcam during initial installation. Then, the vehicle’s location is used to determine to which authority the payment should be sent.
A variety of modern parking solutions exist, including multi-space meters, pay-by-phone, pay-and-display, and others . However, all of these require the customer to interact with either a physical meter machine or their mobile device. With DashCam Pay, the customer could simply give a voice command with either their license plate number or space number, as required by the location. This identifier is then sent along with payment to the relevant parking authority (as identified by the device’s location).
I-C4 Gas stations
Our solution allows the customer to skip the step of inserting payment card at a gas station. Instead, the customer can give a voice command that includes the gas station number. The merchant is again determined by the device’s location.
I-C5 Retail stores
Many options exist for making payments at a point-of-sale terminal in a retail store, including traditional swipe or chip cards, contactless cards, and near-field communication (NFC) payments from mobile devices. Our system could be deployed in a similar spirit at payment terminals in retail locations. These terminals would be equipped with a camera and microphone, and would allow customers to make in-store payments without interacting with their mobile device or payment card.
I-D Key Contributions
The major contributions of this work are:
A multi-biometric system based on face and voice that enables secure and seamless in-vehicle payments
An open-loop realization of this system that is not tied to any vehicle manufacturer or to any proprietary payment application; the fundamental system design is reusable for other payment scenarios such as in-store payments
A privacy-preserving biometric authentication protocol that compares encrypted biometric data locally between a vehicle-mounted device and nearby mobile devices, driven by a dynamic biometric template gallery based on passengers currently sitting in the vehicle
Ii Payer Identification
Our three key objectives were to build a system that is (1) seamless, (2) secure and (3) responds in real-time. With these in mind, the system design includes the following modules:
Trigger phrase detection: allowing the user to invoke the dashcam in a hands-free manner by speaking a specific phrase (, ”Hey DashCam…”)
Face recognition, allowing (a) enrollment of a user’s face on the user’s personal mobile device, and (b) real-time face recognition on the dashcam device to identify the potential payers from the passengers with mobile devices connected to the dashcam (whether in advance at connection time, periodically throughout the ride, or at the time a voice command is issued)
Speech recognition: for the parsing and execution of the user’s voice commands
Speaker recognition: allowing (a) enrollment of the user’s voice on the user’s personal mobile device, and (b) real-time speaker recognition on the dashcam device to determine which of the identified potential payers (using face recognition) is the payer
Wireless communication interface: facilitating communication between the users’ personal devices and the dashcam device (such as BLE or Wi-Fi Direct)
Payment network interface: connecting the users’ devices to the payment network for the release of payment funds to the merchant
Cryptography module: for generating and managing homomorphic encryption keys to enable privacy-preserving biometric comparison
Ii-a Feasibility Study
To assess the feasibility of performing biometric recognition in real-life, in-vehicle scenario, we collected data using a commercial dashcam device (Vantrue N2 Pro Uber Dual Dash Cam111Uses Sony Exmor IMX323 sensor with four infrared LEDs for dual 1920x1080p visible and infrared spectrum video.).
|#Subjects||#Vehicles||Visible rec.||Infrared rec.||Audio rec.|
|20||5||4 mins.||2 mins.||2 mins.|
The data consists of audio and video streams from 20 subjects in five different vehicles at two different sites, captured in both visible and infrared spectrum. An example of a captured video frame is shown in Fig. 1. The dashcam’s default settings were used, whereby it automatically switches to infrared wavelength video recording when low lighting levels are detected. For each subject, approximately four minutes of visible spectrum and two minutes of infrared spectrum video were captured. Audio data was captured simultaneously, with scripted verbal commands comprising approximately two minutes of total audio per subject.
Each subject was recorded while giving a number of voice commands for four different use cases: fuel, toll, parking and fast food. These commands were constructed specifically for ease of use, and to ensure that the user includes the necessary information for each proposed use case (, parking space number, order number) Example sentences include:
”Hey DashCam, pay for parking at space number 5208.”
”Hey DashCam, pay for order number 120.”
”Hey DashCam, pay for toll.”
”Hey DashCam, pay for gas at pump six.”
Next, we evaluated one state-of-the-art biometric recognition algorithm each for face detection, recognition, trigger phrase detection, speech and speaker recognition to determine the feasibility of building the proposed system.
|Face Detection||MTCNN ||TPR=99.1%,|
|Face Recognition (1:N)||FaceNet ||TPR=98.9%|
|Trigger Phrase Detection||Mycroft AI Precise ||TPR=98.2%,|
|Speech Recognition||DeepSpeech ||WER=3.65%|
|Speaker Recognition (1:N)||COTS||TPR=98.4%|
Pre-trained multi-task convolutional neural network (MTCNN)-based face detector was used for face detection experiments. Face locations of subjects sitting in the vehicle were manually labelled with a bounding box in each frame of a recorded video. Face detection performance was computed in aggregate over all frames in a video. For example, if a face was detected in a location that does not overlap with a manually labelled bounding box it was recorded as 1 false accept. On the other hand, if no face was detected within a bounding box it was recorded as 1 false reject. In our experiments, MTCNN-based method resulted in false accepts and false rejects on the collected data. Face detection failure was observed in captured frames with (i) extreme facial pose of subjects, or (ii) lack of proper illumination. Figs. 3(a) and 3(b) show failure examples. Note, however, that extreme facial pose is less likely to be encountered in practice at the time of recognition as the subjects cooperate while interacting with the dashcam.
Face Recognition: Pre-trained FaceNet model  was used for face recognition experiments. An image of each subject sitting in the vehicle captured using their smartphone along with their subject identifier was assumed to be enrolled in the gallery. Detected faces in each video frame were assumed to be probe images. Subject identifiers for probe images were manually labelled. Small-scale identification experiments (1: N where N is the number of subjects sitting in the vehicle) were performed to assess performance on the collected data. Identification performance was measured using true positive identification rate (TPIR) at a fixed false positive identification rate (FPIR). In our experiments, FaceNet yielded TPIR of at FPIR of . Fig. 3(b) shows failure of face recognition in low lighting environment in the visible spectrum.
Trigger Phrase Detection: Mycroft AI precise  was used for trigger phrase detection experiments. Positive examples corresponding to speech samples containing the trigger phrase, and negative examples referring to speech samples that do not contain the trigger phrase were manually created from the data. Pre-trained models are available for certain trigger phrases. However, to train a custom detector for the phrase ”Hey DashCam”, audio data was split into training and testing (50-50 split: 10 subjects for training and 10 for testing). For each positive sample, five negative samples were used in the training set to reduce false positives. The trained detector yielded false negative and false positive detection rates of and , respectively, on the test data.
Speech Recognition: Mozilla’s open-source DeepSpeech implementation 
was used for speech recognition experiments. Audio data extracted from the video streams was manually labelled based on the audio commands spoken by the subjects. Word error rate (WER) which measures incorrectly detected words in the speech was used as the evaluation metric for testing recognition performance. The pre-trained model was first evaluated on the extracted audio data. Although the pre-trained model achievedWER on the LibriSpeech test-clean benchmark, WER was obtained in our evaluation on the collected data. The model frequently failed to recognize relevant words like “dashcam”, “gas”, and “parking”. Because the model performance was below par, audio data was used for fine-tuning the pre-trained model. Cross-validation experiments (20 percent for fine-tuning, and 80 percent for evaluation) were conducted to measure performance after fine-tuning. Post fine-tuning, average WER reduced to . The majority of errors (
) were partial; for instance, ”parking” recognized as ”parting”. Such errors were corrected using a dictionary of permitted words/phrases in a command. The following heuristic was used: if edit distance between the detected word and the closest word in the dictionary was less than or equal to 2, the detected word was auto-corrected to the closest word in the dictionary. Using this approach, overall WER reduced to.
Speaker Recognition: A commercial off-the-shelf system (COTS) was used for speaker recognition experiments. Three samples of an audio command from each subject were enrolled in the system. Small-scale identification experiments (1: N where N is the number of subjects sitting in the vehicle) were performed to assess speaker recognition performance. Identification performance was measured using true positive identification rate (TPIR) at a fixed false positive identification rate (FPIR). TPIR of was obtained at FPIR of .
Table II summarizes the evaluation results. Overall results show the feasibility of conducting in-vehicle payer identification using face and voice biometrics.
|Kaldi||Trigger phrase detection, speech recognition, speaker recognition||Multifunctional, high accuracy||No Android library available|
|Mycroft AI Precise||Trigger phrase detection||Reported to have good performance, good performance in evaluation||No Android library available|
|Kitt.AI Snowboy||Trigger phrase detection||Reported to have good performance||Requires licensing for custom phrase model|
|Picovoice Porcupine||Trigger phrase detection||Reported to have good performance||Requires licensing for custom phrase model|
|FaceNet||Face recognition||Fast and accurate face detection and identification in evaluation, Android implementations available||Pre-trained models for mobile devices are too large for distribution (approx. 200MB)|
|Mozilla Project DeepSpeech||Speech recognition||High accuracy, pre-trained models can be easily fine-tuned, Android implementation supported||Cannot use fine-tuned models on Android (at time of prototype development)|
|Android SpeechRecognizer||Speech recognition||High accuracy, model files are downloaded as part of Android OS||Not available for iOS user devices or non-Android commercial dashcams|
|Alize||Speaker recognition||Android implementation supported||Poor performance in evaluation|
|COTS||Speaker recognition||Good performance in evaluation||Offline implementation not available|
Iii Prototype Development
Commercially available dashcams are not programmable, so we decided to implement the prototype as a software stack in Android, wherein one Android device is mounted in-cabin to act as the dashcam, and the others act as personal devices of the passengers (Fig. 5). The software packages that were considered for use are summarized in Table III.
Iii-a Wireless networking interface
One basic requirement for the system to work in real-time is a framework to connect the dashcam to any nearby mobile devices that have the payment app installed and configured. This connection needs to be both secure and seamless, occurring with no intervention from the user.
Given this requirement, we selected Google’s Nearby Connections API. This API uses both Bluetooth and Wi-Fi Direct to connect devices. The combination of modalities allows for the high speeds and versatility of Wi-Fi, with Bluetooth as a fallback. These technologies are also platform-agnostic, and can be implemented on any realization of the dashcam device, whether it be a mobile device, or a Raspberry Pi-like device (via Google Things).
Iii-B Trigger phrase detection
The prototype implementation platform (Android) did not provide functionality for a third-party application (, Mycroft AI precise ) to continuously access the device microphone. Therefore, we implemented trigger phrase detection using Google Assistant as a bridge. This allows the user to give a command such as ”Hey Google, open DashCam Pay” or even ”Hey Google, tell DashCam Pay to pay for gas at pump number four,” which would then trigger the same authentication and payment flow described above.
Software packages that could be useful for platforms other than Android are listed in Table III.
Iii-C Face recognition
FaceNet provided reasonable face recognition performance in our evaluation. Although the available model files for FaceNet are somewhat large in size (approx. 200MB), the model can be ported to the Android platform. Hence, it was used in our prototype. Note that other state-of-the-art models like SphereFace and CosFace can also be used.
Iii-D Speech recognition
The pre-trained DeepSpeech model  did not provide adequate performance during our evaluation, so we had to fine-tune the model on the collected data. At the time of prototype development, however, DeepSpeech’s Android library did not support the use of custom models. Therefore, Kaldi and Android’s SpeechRecognizer class  were considered as potential alternatives. However, no pre-built Android library was provided for Kaldi, and running the provided binaries on Android required root access. Given this, we chose Android’s SpeechRecognizer class for the prototype, as it did not require additional software, and has the ability to run offline in real-time with state-of-the-art results. Note, however, that this library is Android-specific.
Iii-E Speaker recognition
Open-source SDKs like Kaldi and Alize were initially considered for speaker recognition. As mentioned earlier, for Kaldi, there is no pre-made Android library, and running the binaries on an Android device requires root access. In the case of Alize, an Android implementation exists, but the performance was unsatisfactory during the initial live testing of the Alize demo application for android in a speaker verification scenario. Significant overlap was observed in the scores returned for the genuine and imposter verification attempts.
The initial system design included a fully offline authentication process. However, due to the above challenges with prototype platform compatibility and performance, we implemented the prototypical speaker recognition using a commercial off-the-shelf (COTS) speaker recognition API to demonstrate the system’s feasibility.
Iii-F Privacy-preserving biometric comparison
Our system creates, stores and compares each user’s biometric templates on local devices (namely, the user’s mobile device and the dashcam), without transmitting the templates to a remote server. Additionally, templates are stored and compared in the encrypted domain. Below we describe the different steps in the privacy preserving biometric comparison protocol used in our system (Fig. 6).
At the time of initial enrollment, a user captures their face and voice data on their personal mobile device. Specifically, the user is prompted to take one or more selfie images, and to record voice commands pertaining to different use cases, such as drive-through, parking and gas. Face and voice templates are generated from this data. Additionally, the user device generates a public-private key pair using homomorphic encryption, , pallier encryption , encrypts the face and voice templates using the public key, and stores the encrypted templates. Once enrolled, the user’s device becomes available for connecting to a dashcam over the wireless networking interface. Additionally, the user also explicitly permits pairing their device to a dashcam device.
Iii-F2 Device discovery
While operating, a dashcam periodically searches for enrolled devices in proximity, and initiates a secure connection with each nearby enrolled device over the wireless networking interface. If the connection is successful, each device sends their encrypted templates and public key to the dashcam. This results in the creation of a dynamic template gallery on the dashcam.
When a voice command is given, the dashcam captures an audio recording of the command, as well as an image of the vehicle’s cabin. A voice template is generated from the command audio, and a face template is created from each detected face. The authentication is then performed in two steps. First, the templates generated from the detected faces are used to identify candidate payers from the passengers with mobile devices connected to the dashcam. Second, the voice template from the command is used to disambiguate between the candidate payers to determine who issued the command.
Comparisons are performed in the encrypted domain using homomorphic encryption  such that the encryption does not impact biometric matching performance, and encrypted similarity scores are generated. Alternatively, secure multi-party computation techniques  can be used. Each encrypted similarity score is sent to the corresponding user’s device, where it is decrypted with the user’s private key. The user’s device compares the score to a predetermined threshold, and reports whether a match was found to the dashcam using a zero-knowledge proof. Use of zero knowledge proof results in the verification of the match by the dashcam without exposing the user’s biometric templates or private key.
Once the dashcam verifies that a match is obtained, and, if exactly one matching payer is found, the dashcam instructs the payer’s mobile device to initiate the transaction. If no matching payer is found or more than one matching payer is found, a remedial action is taken. One remedial action when authentication fails, for instance, is to request the user to retry authentication. In the unlikely scenario that authentication fails after multiple attempts, another remedial action is to prompt the user to use an alternative method, or to call a particular number to finish the payment transaction.
In addition to the in-vehicle scenario, the proposed system can also be useful for other types of payment. These include both traditional payment scenarios (such as physical stores, as discussed in Section I-C5, and ATMs), as well as “situated payments”. Situated payments are those scenarios where a traditional point-of-sale (POS) terminal or cash register would not typically be installed, but payments may still be needed (, a farmer’s market, vending machine, or, indeed, a vehicle). In the third quarter of 2019, Square reported a 25% year-over-year increase in gross payment volume, demonstrating that merchants are increasingly using these modern alternatives instead of a traditional cash register. This evolution in POS systems has recently been the focus of industry research as well. Below, we highlight some potential use cases.
Hotels and homeshares: The system could be installed on a kiosk to pay for room service food, pay-per-view television media, snacks, etc.
Vending machines: Existing systems, such as Stockwell cabinets, eliminate physical and tap-based payments for vending machine transactions. In these systems, users register a payment card in the vendor’s mobile application. The user presses a button in the mobile application to unlock the cabinet’s door, and takes the desired products. The machine detects which products have been taken, and debits the card of the user who unlocked the door. Our system could improve upon this by eliminating the mobile application interaction - the user could simply approach the door and speak a command to verify their identity and access the stocked products.
ATMs - The system could be implemented at an ATM to supplement or replace a traditional card and PIN. The user’s device could be used in place of a debit card to initially claim an identity by connecting to the ATM using our framework. Then, the user’s face and voice can be used to validate their identity before providing funds.
Self-driving taxis: As autonomous vehicle technology advances, we may someday be able to hail a self-driving taxi. When entering the vehicle, the passenger may not be known to the taxi company. However, with DashCam Pay installed in the taxi, the passenger’s device can connect to the dashcam to claim the passenger’s identity. Then, the passenger can quickly use their face and voice to verify themselves and provide payment. A passenger’s identity could also be associated with “favorite” destinations, such as their home or workplace, for convenient transportation.
V Conclusions and Future Work
We propose a system for conducting in-vehicle payments using face and voice biometrics. One particularly important feature of the proposed system is its open-loop nature. While proprietary in-vehicle systems can deliver payment capabilities, the proposed system streamlines the user experience. Any merchant or user can participate, without any dependency on payment technology included by a vehicle manufacturer. This alleviates the need for vehicle manufacturers to develop proprietary payment systems.
In the prototype, an Android smartphone was used to mimic a dashcam device. However, in practice, it would be preferable to install this system on a dedicated dashcam device. We invite the research community to integrate the system we present here in an open-source hardware device. Ideally, such a device should be designed with relatively inexpensive, off-the-shelf parts to enable widespread, low-cost deployment.
While open source modules show the feasibility of realizing the proposed system, sourcing commercial software packages for face, speech and speaker recognition is recommended for a production system. This will ensure reasonable system performance and user experience in practice. In addition, the DashCam Pay mobile application needs to interface with merchants in real-time in order to process payments. Furthermore, comprehensive end-to-end testing and large scale real-world pilots are required to refine the system before deployment in production.
-  (2019)(Website) Laboratoire Informatique d’Avignon. External Links: Cited by: TABLE III, §III-E.
-  (2020)(Website) External Links: Cited by: §I-A.
-  (2019) Nearby threats: reversing, analyzing, and attacking Google’s ’Nearby Connections’ on Android. In NDSS, Cited by: §III-A.
-  (2019)(Website) Apple. External Links: Cited by: §I.
-  Cited by: §III-F3.
-  (2019)(Website) Federal Financial Institutions Examination Council. External Links: Cited by: §I.
The knowledge complexity of interactive proof-systems.
Proceedings of the Seventeenth Annual ACM Symposium on Theory of Computing, STOC ’85, New York, NY, USA, pp. 291–304. External Links: Cited by: §III-F3.
-  (2019)(Website) Google. External Links: Cited by: §I.
-  (2019-05) Streaming end-to-end speech recognition for mobile devices. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). External Links: Cited by: TABLE III, §III-D.
-  (2019)(Website) Kaldi. External Links: Cited by: TABLE III.
-  (2017) SphereFace: deep hypersphere embedding for face recognition. External Links: Cited by: §III-C.
-  (2019)(Website) External Links: Cited by: 4th item.
-  (2019)(Website) Mycroft AI. External Links: Cited by: 3rd item, TABLE II, TABLE III, §III-B.
-  (2018-10) Secure face matching using fully homomorphic encryption. 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS). External Links: Cited by: §III-F3.
-  (2019)(Website) Google Developers. External Links: Cited by: §III-A.
-  (1999) Public-key cryptosystems based on composite degree residuosity classes. In Advances in Cryptology — EUROCRYPT ’99, J. Stern (Ed.), Berlin, Heidelberg, pp. 223–238. External Links: Cited by: §III-F1.
-  (2019)(Website) Wikipedia. External Links: Cited by: §I-C3.
-  (2019)(Website) Picovoice. External Links: Cited by: TABLE III.
-  (2019)(Website) Mozilla. External Links: Cited by: 4th item, TABLE II, §III-D.
-  (2019)(Website) Samsung. External Links: Cited by: §I.
-  (2015-06) FaceNet: a unified embedding for face recognition and clustering. . External Links: Cited by: 2nd item, TABLE II, TABLE III, §III-C.
-  (2019)(Website) Shell. External Links: Cited by: §I-A.
-  (2019)(Website) Kitt.AI. External Links: Cited by: TABLE III.
-  (2019)(Website) Mycroft AI. External Links: Cited by: TABLE III.
-  (2019)(Website) Android Developers. External Links: Cited by: §III-D.
-  (2019)(Website) Square. External Links: Cited by: §I.
-  (2019)(Website) Starbucks. External Links: Cited by: §I-C1.
-  Cited by: §IV.
-  (2019)(Website) Rain. External Links: Cited by: §I-A.
-  (2019)(Website) External Links: Cited by: TABLE III.
-  (2019-February 19) Multi-device transaction verification. Google Patents. Note: U.S. Patent App. 10/210,521 Cited by: §I.
-  (2019)(Website) Picovoice. External Links: Cited by: TABLE III.
-  (2018) CosFace: large margin cosine loss for deep face recognition. External Links: Cited by: §III-C.
-  (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: 1st item, TABLE II.