Conversational Intent Understanding for Passengers in Autonomous Vehicles

12/14/2018 ∙ by Eda Okur, et al. ∙ Intel 0

Understanding passenger intents and extracting relevant slots are important building blocks towards developing a contextual dialogue system responsible for handling certain vehicle-passenger interactions in autonomous vehicles (AV). When the passengers give instructions to AMIE (Automated-vehicle Multimodal In-cabin Experience), the agent should parse such commands properly and trigger the appropriate functionality of the AV system. In our AMIE scenarios, we describe usages and support various natural commands for interacting with the vehicle. We collected a multimodal in-cabin data-set with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme. We explored various recent Recurrent Neural Networks (RNN) based techniques and built our own hierarchical models to recognize passenger intents along with relevant slots associated with the action to be performed in AV scenarios. Our experimental results achieved F1-score of 0.91 on utterance-level intent recognition and 0.96 on slot extraction models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Understanding passenger intents and extracting relevant slots are important building blocks towards developing a contextual dialogue system responsible for handling certain vehicle-passenger interactions in autonomous vehicles (AV). When the passengers give instructions to AMIE (Automated-vehicle Multimodal In-cabin Experience), the agent should parse such commands properly and trigger the appropriate functionality of the AV system. In our AMIE scenarios, we describe usages and support various natural commands for interacting with the vehicle. We collected a multimodal in-cabin data-set with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme. We explored various recent Recurrent Neural Networks (RNN) based techniques and built our own hierarchical models to recognize passenger intents along with relevant slots associated with the action to be performed in AV scenarios. Our experimental results achieved F1-score of 0.91 on utterance-level intent recognition and 0.96 on slot extraction models.

2 Methodology

Our AV in-cabin data-set includes 30 hours of multimodal data collected from 30 passengers (15 female, 15 male) in 20 rides/sessions. 10 types of passenger intents are identified and annotated as: Set/Change Destination, Set/Change Route (including turn-by-turn instructions), Go Faster, Go Slower, Stop, Park, Pull Over, Drop Off, Open Door, and Other (turn music/radio on/off, open/close window/trunk, change AC/temp, show map, etc.). Relevant slots are identified and annotated as: Location, Position/Direction, Object, Time-Guidance, Person, Gesture/Gaze (this, that, over there, etc.), and None. In addition to utterance-level intent types and their slots, word-level intent keywords are annotated as Intent as well. We obtained 1260 unique utterances having commands to AMIE from our in-cabin data-set. We expanded this data-set via Amazon Mechanical Turk and ended up with 3347 utterances having intents. The annotations for intents and slots are obtained on the transcribed utterances by majority voting of 3 annotators.

For slot filling and intent keywords extraction tasks, we experimented with seq2seq LSTMs and GRUs, and also Bidirectional LSTM/GRUs. The passenger utterance is fed into a Bi-LSTM network via an embedding layer as a sequence of words, which are transformed into word vectors. We also experimented with GloVe, word2vec, and fastText as pre-trained word embeddings. To prevent overfitting, a dropout layer is used for regularization. Best performing results are obtained with Bi-LSTMs and GloVe embeddings (6B tokens, 400K vocab size, dim 100).

For utterance-level intent detection, we experimented with mainly 5 models: (1) Hybrid: RNN + Rule-based, (2) Separate: Seq2one Bi-LSTM + Attention, (3) Joint: Seq2seq Bi-LSTM for slots/intent keywords & utterance-level intents, (4) Hierarchical + Separate, (5) Hierarchical + Joint. For (1), we extract intent keywords/slots (Bi-LSTM) and map them into utterance-level intent types (rule-based via term frequencies for each intent). For (2), we feed the whole utterance as input sequence and intent-type as single target. For (3), we experiment with the joint learning models [1, 2, 5] where we jointly train word-level intent keywords/slots and utterance-level intents (adding <BOU>/<EOU> terms to the start/end of utterances with intent types). For (4) and (5), we experiment with the hierarchical models [6, 3, 4] where we extract intent keywords/slots first, and then only feed the predicted keywords/slots as a sequence into (2) and (3), respectively.

3 Experimental Results

The slot extraction and intent keywords extraction results are given in Table 1 and Table 2, respectively. Table 3 summarizes the results of various approaches we investigated for utterance-level intent understanding. Table 4 shows the intent-wise detection results for our AMIE scenarios with the best performing utterance-level intent recognizer.

Slot Type F1
Location 0.95
Position 0.95
Person 0.97
Object 0.89
Time Guidance 0.85
Gesture 0.92
None 0.98
AVERAGE 0.96
Table 1: Slot Extraction Results (10-fold CV)
Keyword Type F1
Intent 0.94
Non-Intent 0.99
AVERAGE 0.98
Table 2:

Intent Keyword Extraction Results (10-fold CV)

Utterance-level Intent Detection Models F1
Hybrid-1: RNN + Rule-based (intent keywords) 0.86
Hybrid-2: RNN + Rule-based (intent keywords & slots) 0.90
Separate-1: Seq2one Bi-LSTM 0.88
Separate-2: Seq2one Bi-LSTM + Attention (withContext) 0.89
Joint: Seq2seq Bi-LSTM (intent keywords & slots & utterance-level intent types) 0.88
Hierarchical & Separate-1 0.90
Hierarchical & Separate-2 (Separate-1 + Attention) 0.90
Hierarchical & Joint 0.91
Table 3: Utterance-level Intent Recognition Results (10-fold CV)
AMIE Scenario Intent Type F1
Stop 0.92
Finishing the Trip Park 0.94
Use-cases PullOver 0.96
DropOff 0.95
Set/Change Set/ChangeDest 0.89
Destination/Route Set/ChangeRoute 0.88
Set/Change GoFaster 0.88
Driving Behavior/Speed GoSlower 0.88
Others OpenDoor 0.95
(Door, Music, A/C, etc.) Other 0.82
AVERAGE 0.91
Table 4: Intent-wise Performance Results of Utterance-level Intent Recognition Models: Hierarchical & Joint (10-fold CV)

4 Conclusion

After exploring various recent Recurrent Neural Networks (RNN) based techniques, we built our own hierarchical joint models to recognize passenger intents along with relevant slots associated with the action to be performed in AV scenarios. Our experimental results outperformed certain competitive baselines and achieved overall F1-scores of 0.91 for utterance-level intent recognition and 0.96 for slot extraction tasks.

References