Explainable AI, in the context of autonomous systems, like self-driving cars, has drawn broad interests from researchers. Recent studies have found that providing explanations for an autonomous vehicle’s actions has many benefits (e.g., increase trust and acceptance), but put little emphasis on when an explanation is needed and how the content of explanation changes with context. In this work, we investigate which scenarios people need explanations and how the critical degree of explanation shifts with situations and driver types. Through a user experiment, we ask participants to evaluate how necessary an explanation is and measure the impact on their trust in the self-driving cars in different contexts. We also present a self-driving explanation dataset with first-person explanations and associated measure of the necessity for 1103 video clips, augmenting the Berkeley Deep Drive Attention dataset. Additionally, we propose a learning-based model that predicts how necessary an explanation for a given situation in real-time, using camera data inputs. Our research reveals that driver types and context dictates whether or not an explanation is necessary and what is helpful for improved interaction and understanding.
Human-centered computing User studies Computing methodologies Computer vision Computing methodologies Temporal reasoning Human-centered computing Accessibility Human-centered computing User centered design
Artificial intelligence (AI) is becoming prevalent in everyday life, powering smart devices, personalizing assistants, and enabling autonomous vehicles. However, people encounter difficulties to accept, understand, or trust this technology [hengstler2016applied]. This phenomenon is especially true in the self-driving car industry, where people are hesitant to hand over control of the steering wheel to AI [people_low_trust_av, 10.1145/3369457.3369493]. One of the reasons for this distrust is that people are uncertain about how those sophisticated models make car control decisions. With the ambiguity in the decision process, it is hard for passengers to tell whether the model makes the right judgment given the current situation. The consequence of this ambiguity includes financial loss, legal issues, and even loss of human lives [accident].
To improve the understanding of the self-driving car decision process better, researchers have explored different perspectives to investigate explainable AI (XAI) in the autonomous vehicle domain [8631448, Gunningeaay7120]. For example, the state-of-the-art research has shown that introducing explanations for self-driving control decisions can increase trust [DU2019428]. The study found that people trust (and prefer) explanations presented before the car takes an action, compared to after-action, no explanation, and intervention-based explanations. Others argued that explanations for autonomous vehicles should be modified based on context [trust_explain_AV].
To generate text-based descriptions, researchers collected the Berkeley DeepDrive Explanation Dataset, where they extensively annotated every single car decision with explanations [BDDX]. However, the existing research over-emphasized the benefits of introducing explanations for self-driving cars and neglected the possibility that people might not need those explanations during certain scenarios. In other words, it is still unclear when it is necessary to introduce explanations about the autonomous decision, and whether we should address individuals differently.
In this work, we examine when and how explanations should be presented to users of autonomous vehicles. Specifically, we investigate which scenarios people need explanations for and how the critical degree of an explanation shifts with situations and driver types. We focus on text-based descriptions with varying content to assess what information and narrative is preferred. Our findings are validated in a survey-based user experiment online, in which subjects imagine themselves as passengers of the vehicles in driving video clips of a variety of scenarios. For each video, we record each participants’ reported explanation necessity rating, attentiveness score, and preferred explanation content. During the post-survey, we collect participants’ responses on their driver types and general trust level on autonomous vehicles.
Using the data collected, we aim to understand the relationship between driver types and the necessity of an explanation, for particular contexts. Specifically, in early tests, we observed disagreements among participants on the explanation necessity level. For instance, we found that people tend to agree on that explanations are necessary in near-crash events, but there was no obvious agreement for ordinary or anomalous driving situations in aggregate. As we will show, when examining factors like driver type (i.e., cautious, aggressive) and context, a relationship is uncovered linking the necessity of explanations with the scenario and driver type.
Building on this insight from the user study, we also present a data-driven model for estimating the explanation necessity level online. This model would allow us to assess when and if an explanation is needed, and can be tuned to particular user preference. To explore the explanation necessity level for more diverse scenarios, we present a self-driving explanation dataset by augmenting the Berkeley Deep Drive Attention dataset[BDDA]. Each video in the dataset is annotated with explanation content, the explanation time interval, and associated measures of the necessity for all 1103 video clips. The associated explanation necessity score ranges from 0 to 1, suggesting how critical an explanation is needed for the given scenarios. Our proposed model makes personalized, real-time prediction about whether or not an explanation is needed. The model takes a sliding window of a camera video as the input and outputs a binary decision on whether an explanation is necessary.
The proposed solution can adapt to the state-of-the-art explanation generation model by suggesting explanation moments. The combined system can generate a text-based explanation at the right moment for different individuals, where existing methods continually generate explanations with little to no personalization. Tuning the frequency of explanations (or generally interventions) are key for improving acceptance and trust in autonomous systems[shia2014semiautonomous, driggs2015improved].
In summary, our contributions are as follows:
Our user study shows that the explanation necessity level is affected by driver types and driving scenarios or context.
We find that people tend to agree on the explanation necessity more for near-crash/emergent driving scenarios, and less for the ordinary driving situation.
We present a dataset with video clips labeled with an explanation necessity degree, an explanation moment, and first-person explanation content.
We propose a model that makes personalized predictions in real-time about whether or not an explanation is needed.
The paper is organized as follows. In the next section, we present related work on XAI for the autonomous vehicles and its relationship to trust and user acceptance. Then, we introduce our user study experiment on the necessity of explanations. Using the insight of how explanation necessity varies between scenarios, we present an explanation necessity dataset, which augments the original Berkeley Deep Drive Attention dataset. Finally, with our explanation necessity data available, we propose a learning-based model that can predict whether an explanation is needed in real-time. We conclude with a discussion of the implications, limitations, and findings of our work.
3 related work
As has been found for many AI-enabled systems, autonomous cars (and driving assistance systems) are likely better off with self-explaining capabilities, which we hypothesize would facilitate its acceptance among mainstream consumers.
3.1 Benefits of Explainable AI
There has been a considerable amount of studies on the benefits of XAI [10.1007/978-3-030-12544-8_21]. Specifically in the area of autonomous vehicles, Choi et al. concluded that trust is a significant determinant for people’s intention of using self-driving cars [related-choi]. With the role of trust established, Koo et al. demonstrated that providing the why information, which specifies the reasoning of an autonomous vehicle’s action, created the least anxiety and the highest trust [Koo2015]. These propositions of the relationship between humans and AI were soon confirmed by Petersen et al., who investigated trust in vehicles with driver assistance systems by manipulating drivers’ situational awareness. Their study reached the conclusion that situational awareness promoted and moderated the impact of human trust in autonomous vehicles [Peterson-related].
3.2 Forms of Explanations
To properly provide the “why" information or the situational awareness mentioned above, one needs to specify how an explanation will be provided. Wiegand G., et al. conducted a user experiment on the effect of nine visual explanations about the driving scenario [visual_explanation]. In their study regarding how these explanations should be provided, Ruijten et al. presented an intelligent user interface that mimics human behavior [Ruijten-related]. Their research took place in a fixed-base driving simulator, and the simulator would have a spoken conversation with the participants whenever an automated action took place. Performed in a similar experiment setting, Koo et al. ’s study, which also took place in a simulator, provided the explanations via voice alerts [Koo2015]. Different from the above formats, our experiment takes the form of an online survey, without specifying how explanations will be provided, as we are interested in the explanation timing and content.
3.3 Contents of Explanations
Besides the various formats of providing the information, there are also studies about the optimal content of explanations. Studies in user interface design have shown that the amount and type information conveyed impacts user trust and situational awareness [rezvani2016towards, 8317686]. As mentioned above, Koo et al. analyzed the different outcomes of providing different explanations. In specific, providing only how information which describes the action itself (e.g. “The car is braking") led to poor driving performance; providing only why information which describes the reasoning for actions (e.g. "Pedestrians ahead") led to better driving performance and was preferred by drivers; providing the how and why information led to the safest driving performance but increased negative feelings in drivers [Koo2015]. While Koo’s study categorized possible explanations and suggested the optimal explanation content, they focused on providing explanations in a third-person perspective. They ignored the different reactions of participants when using the first-person narrative.
In our experiment, we investigated the favorability rankings of explanations among: "cause and effect" information in the first-person perspective, "cause and effect" information in third-person perspective, "effect" information in the first-person perspective, and "effect" information in third-person perspective. Furthermore, we also asked our participants whether they would prefer to add a human-centered component (e.g., "Don’t worry") to their top-ranked explanation content choice.
3.4 Timing of Explanations
Apart from the forms and contents of explanations, we argue that it is also crucial to explore the timing of explanations. Timing for warnings and potential interventions is a key concern for (semi-) autonomous vehicles [mok2015emergency], motivating the fact the identification of key events is crucial for trustworthy autonomy. Koo et al. claim that it is critical to provide information to drivers/passengers ahead of an event [Koo2015]. Haspiel et al. designed a user experiment that introduces the importance of timing explanations in promoting trust in AVs [before_action]. Their study has discovered a pattern that suggests that explanations provided before the AV action promote more trust than explanations provided after.
Although existing studies have come to the agreement that explanations are more meaningful when put before an autonomous vehicle action, they failed to take into account that the necessities of providing explanations vary in different driving scenarios. In our experiment, we introduced a "critical score," which is a number ranging between 0 and 1, indicating how necessary an explanation is needed at each timestamp.
3.5 Definition of Critical Score
In previous studies about the relationship between human and autonomous vehicles, there have been various definitions of a critical score, namely, how critical a driving situation is for the passenger. Notably, Yurtsever et al. asked ten participants to watch driving video clips and give a score (subjectively) for the risk of maneuver seen in the videos [Yurtsever-risk]
. After normalizing each annotators’ ratings and taking the mean score as the final risk rating, they defined the top 5% of the riskiest videos as risky. However, it is necessary to distinguish "critical" (subjective) driving scenarios from "accident likely" (objective) situations. In another study, it has been proven that the perceived risk by humans is not necessarily proportional to the actual collision or accident probability associated with a specific driving situation[Fuller-related]. Keeping these in mind, we proposed a human-centric, XAI-friendly definition of critical score: the necessity of explanation related to a particular driving maneuver.
3.6 Self-driving Explanation Dataset
The existing explanation dataset for self-driving suffers from various issues. For example, Berkeley DeepDrive eXplanation Dataset exhaustively labeled 6970 driving clips with explanations in specified video intervals [BDDX]. However, a large portion of their driving clips is uneventful samples (e.g., cruising on the highway with constant speed), where humans require little need for the self-driving system to explain the situations. Furthermore, portions of some video clips are anomalous where the drivers do not follow the traffic rules (e.g., does not stop at a stop sign), thus have a poor (or illogical) explanation given the rules of the road. Meanwhile, as the explanations focused on describing the car model’s decisions, the explanation content may not be ideal to promote a smooth conversation with human.
4 user study
To understand people’s need for a self-driving explanation for different scenarios, we conducted an online survey-based experiment (Figure 1). The experiment took 40 minutes, where we showed participants driving video clips and collected their responses. The goal of this experiment is to understand: (1) how necessary is a text-based explanation about self-driving car actions for different scenarios; (2) what the generally preferred explanation content is and if this is related to context; and (3) the relationship between user trust and explanations for autonomous vehicles.
In our experiment, we target the following outcomes (dependent variables): explanation necessity, preferred explanation content, and user trust. We manipulate, vary, or estimate the following independent variables and influential factors: attention (how much attention participants pay to the videos); driver types (aggressive or cautious); driving scenarios; explanation content (cause, effect, narrative type, Table 1); and presence of explanations. Based on the above three sets of dependent / independent variables, we derived three hypotheses:
Explanation necessity is correlated with attention, driver types, and driving scenarios.
User’s preferred explanation content is dependent on driving scenarios.
The presence of explanations will increase a user’s trust in the automated vehicle.
In total, we have 18 participants for this user experiment. The majority of our participants are college students age between 18 and 51. The participants are qualified as long as they can see the video clips through our survey system. Among the participants, 16 out of 18 participants have a driver’s license, and their driving experience is evenly distributed from 0 to 6 years, with one participant having more than six-year experience.
4.4 Sample Strategy
The driving video clips are sampled from our explanation dataset, described in the Dataset section. To capture a variety of typical scenarios, we conducted a text-based clustering using our annotated explanations for each video clip in our dataset.
Our goal is to capture the different but representative scenarios in the dataset. We used hierarchical clustering with average linkage[johnson1967hierarchical]. In specific, we pre-processed the input text by lowercase and converting to TF-IDF score [ramos2003using]
. We used cosine similarity for the distance metric of the clustering. In the end, we used the videos in the 38 cluster centers for the user experiment.
4.5 Study Design & Procedure
We used Google Forms as our platform for this user experiment. The experiment took 40 minutes to finish, including a break every fifteen minutes.
|action + reason||first-person||I’ll slow down because the traffic light is broken.|
|action + reason||third-person||The car will slow down because the traffic light is broken.|
|action||first-person||I’ll slow down.|
|action||third-person||The car is about to slow down.|
During the experiment, our participants watched 38 independent short driving video clips (as described in the previous subsection). Participants were told to imagine themselves as passengers riding in the vehicles. While the video was playing, the participants were free to be slightly distracted. One thing to note here is that the video shown to participants was the raw video, without an explanation. Moreover, we randomized the order of sampled videos to reduce bias, and each participant watched all sampled videos at the end of the experiment. After each video, participants answered several follow-up questions. The users rated how necessary an explanation is for the clip, referred to as a necessity score. Then, they described how attentive they were while watching the clip. Finally, they ranked several explanation candidates that we had prepared for each of the sampled videos. In particular, we prepared four different types of explanation contents separately for all of the 38 driving scenarios, as presented in Table 1.
During the post-study phase, we prepared several questions related to how the text-based explanation can affect people’s trust in autonomous vehicles, and to what the participant’s driver types were. In particular, we asked participants about seat preferences under the ordinary car, autonomous vehicle, and autonomous vehicle with an explanation (Figure 2).
4.6 Quantitative Analysis
We summarized our findings in three different aspects:
A correlation analysis between different scenarios and the reported explanation necessity level using Pearson correlation and Point-Biserial Correlation was performed [benesty2009pearson, PBC].
To test whether there is a global preferred explanation content format, we performed a Friedman test to check the distribution of different explanation options [wiki:Friedman_test].
We analyzed the relationship between the presence of explanations and trust through the seat preference changing under different conditions.
For this section, we start by describing general statistics about the user study result, and then we elaborate on the three aspects above.
To identify driver types, we considered a participant aggressive if she satisfies any of the following condition: (1)The actual driving speed usually is above 35 mph for the road with the speed limit at 30 mph (2) She describes her driving type as aggressive explicitly (3) She reports changing lane frequently even if unnecessary. Otherwise, we will consider the participant as cautious.
4.6.1 General Statistics
From our post-study questions, we learned some general views from our participants on autonomous vehicles. Even though 84.1 % of people are personally excited about autonomous vehicles by giving the rating higher than five from a 1 to 10 Likert scale, participants expressed general doubts on the feasibility of autonomous vehicles, where 77.8 % of participants expressed a low level of trust in autonomous vehicles. Among our participants, only 10.6 % of people believe the self-driving techniques will be readily available to the public in the next two years. On the other hand, participants expressed overall trust in the reliability of explanations from autonomous vehicles - 72.2% of the participants rated scores higher than five from a 1 to 10 Likert scale. This finding slightly suggests introducing the right explanation contents has the potential to influence people’s trust in autonomous vehicles.
From the driving clip questions, we learned that the average of explanation necessity level for each of the 38 driving scenarios ranges from 2.27 to 8.22, in a 1 to 10 Likert scale. We observed that people generally disagree on how necessary an explanation is needed for the same scenario, with an average standard deviation at 2.97 across 38 driving scenarios. Furthermore, we observed that the average explanation necessity score given by aggressive drivers is 18% lower than the cautious driver.
4.6.2 Correlation Analysis on Explanation Necessity
We did correlation analysis on the explanation necessity level for the following four different aspects: attentive level, driver types, the presence of motion sickness, and driving scenarios.
For the attentive level, we used the Pearson Correlation to calculate the correlation between explanation necessity and attention score since they are both continuous variables [benesty2009pearson]. The result is 0.19, which indicates that there exists a positive relationship between being attentive to the video clips and the need for scenario explanations. In other words, the more attention a participant pays to the video clips, the more necessarily she needs an action explanation from autonomous vehicles.
For driver types, we used the Point-Biserial Correlation to calculate the correlation between explanation necessity and driver type [PBC]. The resulting correlation is -0.14, which means that there exists a negative relationship between being an aggressive driver and the need for scenario explanations. In other words, the more aggressive a driver is, the less he or she needs an account from the vehicle.
Similarly, for the presence of motion sickness among the participants, we used the Point-Biserial Correlation to calculate its correlation with explanation necessity. The result is 0.046, which indicates that there is a weak positive relationship between motion sickness and needs for explanations.
|Scenarios||Correlation with Necessity|
Finally, for driving scenarios, we tagged binary attributes for the 38 driving scenarios in advance (e.g., whether the vehicle in a video slows down, or whether the video is a near-crash scenario). Then, we used Point-Biserial Correlation to calculate the correlation between explanation necessity and the corresponding scenario, as shown in Table 2) [PBC]. For example, the first row shows that the correlation between explanation necessity and whether the video is a near-crash scenario is 0.18, indicating that an explanation is highly necessary for near-crash situations.
The result of the correlations of explanation necessity with attention, driver types, and driving scenarios proves that our hypothesis 1 - Explanation necessity is correlated with attention, driver types, and driving scenarios - holds.
4.6.3 Explanation Content Preference
To investigate if there is a generally preferred explanation (Table 1) format across all scenarios, we performed Friedman tests separately for different driving situations to deal with the ranking data of explanation contents [wiki:Friedman_test]
. Our null hypothesis,is that there is no difference for different explanation contents. We set the level to be 0.05. According to our result, only 16 out of 38 scenarios reject the null hypothesis. However, we did not find those 16 scenarios sharing quantitative attributes based on our data. Therefore, we concluded that there is no globally preferred explanation format for self-driving scenarios, and thus our hypothesis 2 - Explanation content is correlated with driving scenarios - does not hold.
4.6.4 Explanation and Trust
In the post-study questionnaire, we asked our participants about their preferred seats in three types of vehicles: ordinary vehicles, autonomous cars without explanations, and autonomous vehicles with explanations. From the results of the changes in seat preferences, as shown in Figure 4, we derived two types of sentiment changes. One is the relaxation of participants, which is reflected by changing from front seats to back seats because of the comfort back seats could bring. Another one is due to the anxiety of the participants. People either change from other positions to the driver seat so that one can take control of the vehicles when necessary, or change from front seats to back seats because statistically, back seats are safer than front seats.
When the condition changed from an ordinary car to an autonomous vehicle without explanations, among people who changed seats, only 12.5% changed due to feeling relieved and relaxed in autonomous cars. At the same time, 87.5% of people changed their positions because they do not fully trust the vehicles and would like to take control if anything unexpected happens quickly. Then, we found that among those participants who changed their seat preferences when the vehicle condition changed from an autonomous vehicle without explanations to an autonomous vehicle with explanations, 83.3% changed their seats because they feel relieved and relaxed in an autonomous vehicle with explanations.
Therefore, we conclude that providing explanations in general, can help people be less stressed and worried in an autonomous vehicle, which we believe indicates an elevation of trust. Therefore, our hypothesis 3 - Trust in the vehicles is correlated with the presence of explanations - holds.
4.7 Qualitative Analysis
Besides the quantitative analysis above, we performed a qualitative study on the relationship between scenarios and explanation necessity. We noticed that the standard deviation for explanation necessity scores given by the participants varies a lot for different situations. To investigate whether there are common properties among scenarios that people mostly agree or disagree on, we compared the scene that has the highest standard deviation of explanation necessity with the situation that has the lowest standard deviation of explanation necessity, as shown in Figure 5. From this comparison, we observed that people tend to agree on the near-crash/emergent driving situations (e.g., cars cutting-in suddenly). On the other hand, for the ordinary scenarios (e.g., driving smoothly on the highway), people’s opinion on explanation necessity varies a lot.
Through our quantitative and qualitative analysis, we found that factors related to driving scenarios and passenger identities influence the need for explanations, and therefore generalizing the explanation necessity level for different driving scenarios is challenging. In other words, the explanation necessity has to be analyzed/predicted on a case-by-case basis. We are wondering if a learning-based model can achieve this goal with temporal representations of a driving scene. Given the rich diversity of driving scenarios, we need a large-scale dataset related to the explanation necessity to start with the training process. Therefore, we annotated explanation-related metadata on the video clips of the BDD-A dataset[BDDA]. We introduced our explanation necessity dataset in the next section.
Our user study suggests that the explanation necessity is correlated with driving scenarios, and has to be analyzed on a case-by-case basis. Using this insight, we aim to build a data-driven model that can learn to predict necessity scores that can be tailored to the driving types and contexts.
In order to build this model, we present an explanation necessity dataset for autonomous vehicles. We limit our scope of explanation format to text-based explanation. The purpose of this dataset is to provide precise, case-by-case, and first-person perspective explanations that resolve the following issues: (1) when people need a reason for a driving decision; (2) how critical the explanation should be; and (3) what the first-person perspective explanation content it should be.
Instead of from the Berkeley DeepDrive Explanation (BDD-X) dataset [BDDX], we selected the driving video clips from the Berkeley DeepDrive Attention (BDD-A) dataset [BDDA], which initially contains 1232 braking event driving videos captured by a front-mirror dashcam. Even though the dataset size of the BDD-X dataset is six times greater than the BDD-A dataset, through our empirical analysis, we find a large portion of the videos are ordinary driving scenes (e.g., driving on the highway), which do not contain moments worth explaining explicitly. And the average video duration of the BDD-X is around 30 seconds to 60 seconds, which is much longer for the BDD-A dataset, whose video usually lasts approximately 10 seconds. Besides, in terms of information, BDD-A also provides human gaze, car speed, and GPS metadata per frame.
5.2 Dataset Assumption
Our high-level assumption is similar to the data collection assumption proposed by Xia Y., et al. [BDDA], where participants imagine themselves in the car of the driving videos. Specifically, we made the following assumptions:
The actual driver in the video follows the traffic laws.
The car action in the video is safe. In other words, the car action should not put the passengers into a high-risk car accident.
The recipient of the explanation would be a passenger of a fully autonomous vehicle. This assumption means that the perspective and sense/ability of control is different.
Every driving clips has at least one explanation moment. But the degree of explanation necessity can vary.
If any video clip fails to obey any of the assumptions, we removed it from our dataset.
5.3 Dataset Statistics
Our dataset contains 1103 driving video clips in total. From the video clips in the BDD-A dataset, we filtered out driving clips that did not meet our assumption criteria (e.g., drivers did not follow the traffic rules, poor videos quality like skipping frames) [BDDA]. Five annotators were recruited for this dataset; the background of the annotators is college students age between 18 to 25 years old who have a valid driving license in the US.
5.4 Data Model
|message||string||text explanation in first-person perspective|
|gazemap||mp4||human gazemap at each timestamp|
|necessity score||float||necessity degree for explanation|
|speed||float||car speed at each timestamp|
|course||float||car speed at each timestamp|
|explanation interval||float tuple||time segments for explanation to occur|
Every video clips is annotated with one explanation moment. Each explanation contains a time interval that the explanation should take place, a first-person perspective explanation, and an explanation necessity score, indicating how critical an explanation is needed at one moment (Table 3).
Instead of a binary response, the explanation necessity score is a floating number ranging from 0 to 1. To get a generalized explanation score for each driving clips, we collected responses from 5 different people and used truncated mean to get a general critical score.
5.5 Data Collection Platform
We made an iOS application (Figure 6) to annotate the driving clips dataset, with Firebase as our backend support. The app randomly distributed video clips from BDD-A to annotators, collected explanation responses, and upload the results to the cloud backend. At the end of the data collection process, each person has annotated every single video clip in the dataset.
5.6 Data Collection Procedure
For each of the video clips, the data annotators started by playing the driving video clips (Figure 6). Once they reached a point where they considered the driving scene needed an explanation, the data annotators clicked on the Record button. Then, a pop-up window was presented to ask the participants to give a floating score, indicating how necessary people need an explanation for this moment based on their judgment. Finally, they fine-tuned the timestamp to reflect the moment that needs an explanation. The cases were removed in which either the driver does not follow traffic rules, or the driver action is too risky to be considered as the desired self-driving behaviors. For the explanation moment, we focused on recording before-action explanation, because previous research indicated people trust more on the before-action interpretation in autonomous vehicle settings [before_action].
5.7 Data Post-Processing Steps
To extract a general explanation necessity score from different people’s reactions per example, we used truncated mean, a statistical measure of central tendency [wiki:truncated_mean]. In specific, for each data piece, we calculated the average of the explanation necessity scores after discarding the highest and lowest score. The advantage of the truncated mean is that it can reduce the influence of extreme scores.
As for the explanation timestamp for each video example, we sorted the records annotated by different people for the corresponding explanation event. In other words, the format of the explanation time is a time interval that captures the moments where the relevant explanations should occur.
Our user study suggests that the explanation necessity has to be analyzed on a case-by-case basis. To predict the need for explanation of a particular scenario, we propose a classification model that can predict in real-time about whether a person needs an explanation. The proposed solution can be combined with the state-of-the-art explanation model to generate a text-based explanation at the right moment. The model is trained on our explanation dataset. The input of the model is image data at each time frame. The model outputs a time series of binary decisions for explanation. The architecture of the model is in Figure 7.
For this task, we considered it as a classification problem instead of a regression problem. The reason is that, due to the labeling noise by a human, the annotated necessity score can only reflect a relative level of explanation necessity instead of the absolute value. Therefore, we transformed the regression task to a classification task by introducing a pre-specified necessity threshold, . The threshold, is influenced by human personal related properties, like driver types. For this paper, we assume is available directly. We leave how we should infer about the threshold value based on different personal attributes for future work.
6.2 Task & Notation
The data of the model is a sliding window of video which lasts for four seconds, and a binary label indicating explanation necessity at the end of the video. Our task is to train a classifier to predict whether an explanation is necessary at the end of the input video clip:
where represents the predicted explanation necessity, represents the frame index (1,2,..,N), is the input video frame sequence, is explanation necessity threshold that depends on the passengers’ attributes (e.g., driver types), and represents sign function.
6.3 Model Detail
First, the model generates a temporal representation of the image for the last frame in the video. To generate visual features from video frames, we use the Foveal Visual Encoder [foveal], proposed by Xia Y. et al., 111 Compared with Imagenet pre-trained Alexnet features, the foveal features improved our test AUC from 9.9% to 24.6%
Compared with Imagenet pre-trained Alexnet features, the foveal features improved our test AUC from 9.9% to 24.6%. The proposed image encoder predicts small regions where human eyes will focus on each frame, then extracts the image features only for the focused area from a high-resolution video. Then, our model passes those frame features ( in Figure 7) into Conv2dGRU module [NIPS2015_5955] and several convolution layers to generate a temporal-spatial representation, for each frame , where i = 1,2,…, N and stands for spatial. We extract the output of the last frame, , for the next step, where .
The model converts the previous spatial features, , into a linear representation, , in the spatial transform module. With the benefits of zero information loss, the model flattens the spatial features directly. We also experimented with the weighted sum approach across spatial locations, suggested by Kim J., et al. in Equation (1) of their paper [BDDX]
for their context vector construction. However, the performance decreased due to information loss during their feature compression across spatial locations.
We concatenate the visual features, and acceleration,
together. Then, we use several fully connected layers with ReLU nonlinear, BatchNorm[10.5555/3045118.3045167]
, and dropout layers followed. The last step is to use Sigmoid function to generate predicted necessity scores,, where
. We use Binary Cross-Entropy Loss as our loss function.
In summary, we build a recurrent model to learn explanation necessity. Through our comparison study on different model architecture, we found the foveal feature encoder has better performance in terms of test AUC over Alexnet pre-trained network, even though the foveal feature encoder has information loss on unattended regions at each frame. On the other hand, we learned that flattening works better than the weighted sum proposed by [BDDX] in the Spatial Transform Module for our purpose.
6.4 Training Details
We split our dataset into a 70% training set, a 10% validation set, and a 20% test set. We sampled the video frames at 10 Hz. Then, we extracted training data for video with a sliding window size of 40 frames, (i.e., 4 seconds). For those data piece that has explanation necessity score higher than the specific necessity threshold, we extracted the video window with its end index within the critical time interval (described in the Dataset section) and marked the explanation label as 1. For the rest, we randomly sliced a window with a size of 40 frames in the videos and marked as 0.
We use Adam optimizer with a learning rate of 1e-2, alpha 0.999. The dropout rate is 0.7. To reduce the data imbalance, we did a weighted sampling based on the explanation necessity labels. We set the batch size to be the same as the size of our training dataset to help converge faster. The training of our model took around 1 hour on Nvidia Tesla V100 GPU with 16GB GPU memory, for 300 epochs.
To evaluate our model classification performance, we calculated Area Under the Receiver Operating Characteristic Curve (ROC AUC) on the test set with the Sklearn library function. We trained our model for different explanation necessity thresholds,, based on the general explanation necessity level for each video. We report the result in Table 4. As shown in the table, the model performance decreases with a lower .
One possible reason is that we are using general explanation scores calculated from different people’s ratings. From our user study result (Figure 5
), we learned that for dull video scene without any near-crash case, explanation necessity rating tends to have more considerable variance among people, which makes it harder for the model to distinguish scenarios that people do not get an agreement on.
|random guess||our model|
7 discussion & future work
This paper investigates in-depth about when explanations are necessary for fully automated vehicles. There are two main aspects of results that are interesting for discussion.
We initially hypothesized that people would reach a certain amount of agreement on the explanation necessity level for different scenarios. However, our user study results indicate that people’s opinions on explanation necessity might be the opposite of a certain amount of driving scenarios. Through our qualitative study, people tend to agree on the explanation necessity more for near-crash/emergent driving scenarios and less for the ordinary driving situation. Through our quantitative research, We found that both contexts and individual attributes had a significant influence on the desire for explanation. We are wondering how the passengers will respond if we provide too many explanations to passengers. One possibility is that the overuse of account annoys the passengers such that they turned off the explanation features completely. Another chance is that they might get insensitive to reasons, which might ignore the critical one. They also might be completely okay with additional comments and have no adverse influence. The first two cases will discourage the effects of explanations. We will experiment on this track for future work.
We previously considered that explanation of necessity should be highly correlated with speed changes. In other words, passengers should be more likely to ask for an explanation during speed decreasing moments. However, our correlation analysis between explanation necessity level and scenario types shows that this is not necessarily true. From Table 2, we noticed scenarios related to stop signs have the lowest correlation with necessity, even though the car is expected to decrease speed whenever approaching a stop sign. However, if we look at the scenarios that have a higher correlation with explanation necessity, they are, more or less, events that are not aligned with the passenger’s original expectation. In other words, how different the scenario is from the expectation of passengers might be positively related to the explanation necessity level. We plan to test this hypothesis for future work.
Together with this paper, we presented a model that can identify explanation moments in real-time. To provide personalized prediction, we proposed using the pre-specified explanation thresholds to capture personal attributes, like driver types. However, in this paper, we did not explore how we should build the mapping function from individual characteristics to threshold explicitly. We leave this exploration for our future study.
Finally, we quantified the explanation necessity level for different scenarios by collecting people’s ratings directly. However, due to difference in individual criteria for explanation necessity, the recorded explanation necessity for scenarios have relatively high standard deviations so that it makes it hard to argue the general explanation necessity level for a given situation. Moreover, a linear representation to capture explanation necessity might be problematic, since the necessity level might be in the high-dimension space where each dimension corresponds to a different factor to necessity, like risk. In future work, we plan to explore using reinforcement learning to systematically represent explanation scores for different states based on the rewards from future events.
In this paper, we investigated in-depth on the necessity of explanations for autonomous vehicles. Our user experiment results showed that the need for explanation depends on specific driving scenarios and passenger identities. Along with this paper, we presented a self-driving explanation necessity dataset with first-person explanations and associated measure of necessity for 1103 video clips, augmenting the Berkeley Deep Drive Attention dataset [BDDA]. Finally, we proposed a learning-based model that can offer a personalized prediction on how necessary an explanation for a given situation in real-time, using camera data. Our work can highlight the importance of context and human elements in explainable AI for autonomous vehicles.