Log In Sign Up

A Case Study of Trust on Autonomous Driving

As autonomous vehicles have benefited the society, understanding the dynamic change of human trust during human-autonomous vehicle interaction can help to improve the safety and performance of autonomous driving. We designed and conducted a human subjects study involving 19 participants. Each participant was asked to enter their trust level in a Likert scale in real-time during experiments on a driving simulator. We also collected physiological data (e.g., heart rate, pupil size) of participants as complementary indicators of trust. We used analysis of variance (ANOVA) and Signal Temporal Logic (STL) techniques to analyze the experimental data. Our results show the influence of different factors (e.g., automation alarms, weather conditions) on trust, and the individual variability in human reaction time and trust change.


The Interaction Gap: A Step Toward Understanding Trust in Autonomous Vehicles Between Encounters

Shared autonomous vehicles (SAVs) will be introduced in greater numbers ...

A Workload Adaptive Haptic Shared Control Scheme for Semi-Autonomous Driving

Haptic shared control is used to manage the control authority allocation...

Trust-Based Route Planning for Automated Vehicles

Several recent works consider the personalized route planning based on u...

Anticipated emotions associated with trust in autonomous vehicles

Trust in automation has been mainly studied in the cognitive perspective...

Human-Vehicle Cooperation on Prediction-Level: Enhancing Automated Driving with Human Foresight

To maximize safety and driving comfort, autonomous driving systems can b...

An Investigation of Drivers' Dynamic Situational Trust in Conditionally Automated Driving

Understanding how trust is built over time is essential, as trust plays ...

I Introduction

Autonomous vehicles have achieved a high level of autonomy with the development of various sensors and advanced driver assistance systems. Although autonomous driving requires less human involvement in the vehicle operation, the trust level can affect human’s interaction with the vehicle and decide human’s reliance of vehicle usage [Korber2018, Bailey2007Automation-inducedTrust]. Human operators tend to use the automation they trust and reject it when they do not [Pop2015IndividualAutomation]. Undertrust can lead to the neglect or under-utilization of automation, while overtrust may cause misuse of automation (e.g., delay take-over control when human intervention is necessary) [Parasuraman1997, Lee2004TrustReliance, Lahijanian2016, Miller2016]. Therefore, it is important to understand the role of trust during human-autonomous vehicle interaction, which can help to improve the safety and performance of autonomous driving.

Nevertheless, there are many challenges in gaining insights into the role of trust during autonomous driving. Trust in automation can be influenced by many factors. Intrinsically, a trustworthy autonomous driving system relies on the appropriate integrated implementation of various system components, such as whether to allow manual take-over and how the alarm is delivered. Extrinsically, the ambient environment (e.g., weather condition) and hazardous incidents (e.g., pedestrian crossing) also introduce uncertainties into the change of trust level. Furthermore, as a subjective mind state, human trust is difficult to observe and measure. The existing work mostly use post-experiment surveys or questionnaires to evaluate trust level [Desai2012, Rezvani2016, Koo2015]. However, such methods cannot capture the dynamic change of trust level in real-time [Setter2016Trust-basedAgents, Xu2015OPTIMo:], which is important for autonomous driving (e.g., to decide timely driver intervention actions). Several recent work also use physiological data such as electroencephalogram (EEG), galvanic skin response (GSR), gaze tracking, heart rate variability (HRV) to infer human trust and emotional state [Hu2016, RASTGOO2018]. But they mostly focus on trust analysis for a group of participants. While group-level trust analysis can provide a generalized understanding of human trust, as trust is influenced by disposition and past experience, it varies from person to person. There is a need for trust analysis of individuals.

In this paper, we present a case study of evaluating human trust by a Likert scale in real-time during experiments on a driving simulator. We used analysis of variance (ANOVA) to examine the potential influence of different factors on trust, including driving mode, alarm types, and weather conditions. We also examined the corresponding influence of physiological data (e.g., heart rate, pupil size) since they can be a complementary indicator of trust. Furthermore, we used Signal Temporal Logic (STL) techniques to check patterns of the trust evolution over time, for example, if human trust decreases when the vehicle automaton gives false alarms, or if the trust increases when the vehicle performs well. In order to obtain individualized information on how trust is affected, we also used STL learning techniques to optimize the corresponding reaction time constraints for each individual.

The remainder of the paper is organized as follows: Section II summarizes the related work, Section III describes our human subjects study design and ANOVA analysis results, Section IV presents STL analysis results, and Section V draws the conclusions.

Ii Related Work

Factors that affect trust: Hoff et al. [Hoff2015] presented a survey of different factors (e.g., system reliability, timing of error, difficulty of error, and type of error) that can influence human operators’ trust. Studies have shown that system reliability can affect trust, the frequency, and timing of autonomy mode switch [Desai2012]. Errors in an early stage of automation or on an easy task have a greater negative impact [Manzey2012HumanAids, Madhavan2006AutomationAids]. Rezvani et al. [Rezvani2016] demonstrated that user interfaces with internal and external awareness have different impacts on the driver’s trust. Koo et al. [Koo2015WhyPerformance] studied messages that providing different explanations of autonomous driving actions and showed that describing reason for actions was preferred by drivers and led to better driving performance. In addition, false alarms (i.e., alarms presenting when there is no event) and missing alarms (i.e., no alarm when there is an event) can also affect trust [Hoff2015, Davenport2010EffectsTask].

Biometric indicators: Hu et al. [Hu2016]

built an empirical trust sensor model based on machine learning classification results with EEG and GSR signals. Costa

et al. [Costa2001TrustEffectiveness] shows that human trust is correlated with stress levels, which can be measured by multiple biometric indicators. For example, heart rate (HR) and HRV metrics (i.e., the time fluctuation of heart beats) are widely used indicators for stress level [RASTGOO2018, Pereira2017HeartAssessment]. Photoplethysmogram (PPG) signals controlled by heart’s pumping action are widely used to extract HRV parameters [Elgendi2012OnSignals.]. In addition, Pedrotti et al. [Pedrotti2014] finds that pupillary response signal has good discriminating power for stress detection. The pupil diameter increases by sympathetic nervous system activity when a human is under stress [RASTGOO2018].

Iii Human Subjects Study

Fig. 1: Participant engaging with simulated driving environment, with GSR, PPG, and eye movement being recorded. The buttons embedded in the steering wheel are used to adjust trust level and switch mode. The GUI provides the current trust level, the vehicle speed, the alarm detected, and the driving mode.

Iii-a Driving Testbed Setup

We conducted human subject experiments in a high-fidelity driving simulator (Force Dynamics 401CR, shown in Figure 1), which is a four-axis motion platform that tilts and rotates to simulate the experience of being in a vehicle. The human interacts with the driving simulator through the PreScan software, which can be programmed to simulate autonomous driving scenarios (Figure 2). While driving, the participants’ physiological data (GSR, PPG, eye-tracking) are collected through the Shimmer3 GSR+ sensor and Tobii Pro-Glasses 2. All experimental data are recorded and synchronized via iMotions Biometric Platform [imotions].

Fig. 2: Scenarios; Top: Driver’s view in rainy weather; Center: Driver’s view in sunny weather; Bottom-left: Top view of the ego car; Bottom-right: Top view of the scenario, gird spacing: 100 meters.

Iii-B Experimental Design

We view trust as delegation of responsibility for actions to the automation and willingness to accept risk and uncertainty, following the definition of trust in [Lee2004TrustReliance]. Our experimental design has one primary dependent variable (i.e., trust), and three independent variables (i.e., alarm type, weather condition, and driving mode). We designed 16 driving scenarios considering the different combinations of these variables. Each driving scenario contains four hazardous events: (1) a pedestrian crossing the road, (2) an obstacle in front of the lane, (3) a slow-moving cyclist in the same lane, and (4) an oncoming truck from the opposite direction in a nearby lane. At the time of hazard detection, an auditory alarm with a high frequency (750 Hz) went off to alert the driver about the upcoming hazard. To that end, we designed four types of alarm (details of each alarm is explained below): AAAA, MMMM, FAAAA, AAAFA. All these conditions were counterbalanced so that participants could come across all four types of alarm and hazardous events. Trust was evaluated by a 5-point Likert scale (5 being most trust and 1 as least trust) in each condition with respect to the following independent variables:

  • Alarm type: Each driver experienced receiving the following four types of alarms randomly in a driving scenario:

    • all four alarms were Activated (AAAA),

    • all alarms were Missing (MMMM),

    • an early False alarm (alarm activated for no incident) was triggered (FAAAA),

    • a False alarm between the third and the fourth incident was triggered (AAAFA).

  • Weather: Two weather conditions (sunny and rainy) were used in this study (See Figure 2). In the sunny weather, it has clear visibility whereas in the rainy weather, the visibility was set at 240 meters (we assigned the precipitation density as particles per cell).

  • Driving mode: Participants were allowed to drive either in fully-autonomous or manual mode. In fully-autonomous mode, the driving system would not respond to any input from the wheel, brake, or throttle. Also, the advanced driver assistance system- lane-keeping, forward collision avoidance, and speed maintenance were applied as soon as fully-autonomous engagement. Since the result of higher automation reliability is higher level of trust [meyer2013trust], we designed the system in a way that shifting the driving mode from manual to autonomous could cause trust to be set at 3 (out of 5), but participants could adjust it as long as the vehicle stayed in the autonomous mode. On the other hand, switching to manual driving positioned the trust level at and remained 0.

Two buttons were embedded in the steering wheel (See Figure 1) in order to not only switch between two driving modes, but also either to increase or decrease the trust level. Pressing two buttons at the same time allowed switching driving modes while left and right button was assigned to decrease and increase trust, respectively.

Iii-C Hypothesis

  • H1: Different alarm types correspond to different reliability types. Humans tend to trust system with high reliability. In addition, missing alarms (might cause collision) cause greater negative than false alarms. Moreover, false alarms occur in the early stage leave a negative impression on humans and lead to lower trust levels than those in late stage

  • H2: Rainy weather would give humans the impression that it is harder to fulfill autonomous driving tasks and lead to lower trust.

  • H3: When humans do not trust the autonomous driving system, they tend to switch to manual driving.

Iii-D Experiment Procedure

19 participants (Mean age: 22.57 years, years, 63% female) were recruited from the University of Virginia. The Internal Review Board 111IRB # 20606: Cognitive Trust in Human-Autonomous Vehicle Interaction at University of Virginia has approved the requirements and the study. All of the recruited participants were students ranging in age from 18 to 35 years old. All participants were asked to hold a valid driving license with at least one year of driving experience. All participants had normal or corrected-to-normal vision.

Upon arrival at the lab, participants were instructed to read and sign an informed consent. Participants were informed that they could quit the experiment at any time without any penalty. Participants filled out a pre-experiment demographic questionnaire. Participants were instructed to sit in the driving simulator. Overhead lights were turned off. A three-minute baseline experiment was conducted to record GSR, PPG, and pupil diameter. A three-minute training trial was conducted to allow participants to get familiar with the driving system. 16 trials were conducted, with a break after 8 trials. Each participant received a $20 gift card after the experiment.

Iii-E Data Pre-processing

GSR is a physiological signal captured from the surface of the skin. These signals reflects the electrical conductivity of skin and the arousal of nervous response [Dawson2011TheDecision-making.]. The average of GSR values, and the average of peaks of GSR values are significantly affected by trust and cognitive load [Khawaji2015UsingEnvironment]. We computed GSR peaks from phasic data extracted from GSR signals using a mean filter [dawson2007electrodermal]. For each sample point, the mean GSR of the time interval [-4s; +4s] centered on the current sample was computed. The mean GSR value was subtracted from the current sample. The result is the phasic data. A lowpass filter with cut-off frequency at 5 Hz was applied to phasic data in order to reduce line noise. GSR peaks were found in phasic data between peak onsets () and offsets ().

Heart rate (HR) and heart rate variability (HRV) are two measures that can vary with increasing cognitive load [mehler2011comparison]. The following time domain measures of HRV were calculated from normal-to-normal (NN) of beat-to-beat (R-R interval) variations of consecutive heartbeats [Pereira2017HeartAssessment]

. Increasing HR as a result of cognitive load causes decreasing HRV measurements such as: mean of RR (RRMean), root mean square successive difference between consecutive NN (RMSSD), standard deviation of NN intervals (SDNN), the ratio of adjacent NN intervals differing at least 50 ms (NN50) to the all NN intervals (percentage of NN50 or pNN50). In this study, we only relied on pupil size as a metric of cognitive load obtain from eye-tracker. Pupil size in millimeters was calculated as the average pupil sizes of both left and right eyes.

Iii-F ANOVA Analysis

Mode Weather Alarms Gender
Fully-Auto Semi-auto Rainy Sunny AAAA AAAFA FAAAA MMMM Female Male
Heart Rate
pNN50 0.26(0.15) 0.26(0.16) 0.26(0.15) 0.26(0.15) 0.25(0.15) 0.27(0.16) 0.24(0.18) 0.24(0.15) 0.30(0.16) 0.22(0.13)
RRMean 812.98(109.86) 803.70(115.88) 809.12(113.9) 807.92(112.08) 802.13(113.7) 815.18(110.0) 809.29(112.33) 806.75(116.95) 821.74(108.28) 789.91(116.71)
RMSSD 52.9(21.02) 51.26(21.30) 53.38(20.07) 51.79(22.31) 53.34(24.73) 53.40(22) 50.8(18.48) 50.7(18.99) 55.08(19.74) 47.95(22.35)
SDNN 57.0(19.59) 57.19(20.7) 57.31(20.38) 56.88(19.97) 57.58(21.8) 56.64(18.97) 55.76(16.75) 58.50(22.71) 59.27(19.02) 54.11(21.30)
Pupil Size 3.87(0.39) 4.0(0.39) 4.05(0.39) 3.82(0.37) 3.9(0.41) 3.96(0.38) 3.9(0.37) 3.1(0.42) 3.5(0.41) 3.8(0.37)
Peaks 13.5(8.8) 18.8(11.16) 15.2(10.1) 16.5(10.7) 16.0(10.9) 15.78(10.27) 16.4(10.57) 16.7(6.9) 19.7(9.33) 11.33(9.8)
Trust 3.34(0.81) 3.39(0.69) 3.13(0.81) 3.47(0.7) 3.51(0.73) 3.44(0.68) 3.29(0.72) 3.18(0.84) 3.24(0.76) 3.64(0.79)
TABLE I: Mean and Standard deviations of dependent variables

Table I shows statistics (i.e sample mean and standard deviations) on dependent variables to describe the effectiveness of our experiments. A (mode [fully- and semi- autonomous], weather [rainy and sunny], alarm types [AAAA, MMMM, FAAAA, and AAAFA], gender [female and male]) ANOVA with and user ID as a random was undertaken. Tukey HSD tests were used for post-hoc contrasts. Also, a significance level of was used for all statistical tests, unless stated otherwise.

Results of ANOVA showed significant main effect of gender (, , ) and alarm types (, , ) on average trust. The post-hoc test revealed that both MMMM and AAAA conditions were significantly different than FAAAA and AAAFA. This supports our hypothesis H1 that type of alarm could lead to different trust average. However, we found no statistically significant interaction between alarm types and gender. The weather condition  (, , ) and the driving mode (, , ) did not have significant effect on trust. We also found that participants did not switch to manual driving only due to the weather condition. Thus, our hypotheses H2 and H3 are rejected.

Result of ANOVA also revealed that pupil size, (, , ), and number of peaks in GSR, (, , ), were sensitive to the effect of weather. In addition, results indicated that HRV (, , ) is significantly affected by the gender, as opposed to weather condition(, , ) and type of alarm (, , ). The latter results did not support our hypothesis H2 that rainy weather impact the trust.

We also plot average trust of both weather conditions with respect to all four alarms in Figure 3. It shows that average trust in sunny weather was higher except when all the missing alarms were presented. Where the system failed to provide adequate alarm to the driver in rainy weather, the average trust of participants dropped lower than sunny weather which supports our hypothesis H1.

Fig. 3: The grouped box plot displays the comparison of the average trust between four alarm types under two weather conditions.

Iv STL-based Analysis

Iv-a Signal Temporal Logic

Signal Temporal Logic (STL) [Maler2004MonitoringSignals] is a formal specification language to express temporal properties over real-values trajectories with dense-time intervals. STL is commonly used to describe desired behaviors of cyber-physical systems (e.g., automotive systems, medical devices) [bartocci2018specification]. In this paper, we use STL techniques to analyze trust signals measured in our human subjects study, in order to identify patterns of trust evolution over time.

The syntax of a STL formula over trace is defined as:

where is a closed time interval and . The signal predicate is the formula of the form , where is a signal variable, and is a function from to . The Boolean satisfaction of a given STL formula is True if and only if (i.e. ). The , , and operator stands for ”always”, ”eventually”, and ”until”, respectively. specifies that holds at every time step between and . Similarly, specifies that holds at some time step between and . Finally, specifies holds at every time step before holds, and holds at some time step between and .

Iv-B Checking STL Formulae

Our data set consists of 19 participants. Each participant has 16 trials and each trial lasts 180 seconds. We aggregated each trial into 1800 rows with a time step of 0.1 second. In order to investigate the dynamic of trust, we calculated the Trust Change by subtracting the trust level of the previous second from the one of the current second for the period in autonomous driving mode.

Formula #Trial #Participant
No Event 275 19
51 19
86 19
Fasle Alarm 76 19
75 19
17 9
21 13
11 6
17 9
Missing Alarm 76 19
51 17
39 16
228 19
117 17
135 19
TABLE II: STL Formulae

Breach [Donze2010BreachSystems] is a framework designed for formal analysis and system monitoring. Given a system property as an STL formula, the system is capable of detecting a violation [Watanabe2018RuntimeVehicles]. We used Breach to detect the satisfaction of the formulae. Table II list all the STL formulae considered, the number of trials that satisfied each formula, and the number of participants who had trials that satisfied the formula. It should be noted that one trial can satisfy multiple STL formulae. We analyze the results of checking STL formulae as follows:

No Event. We used to extract the trials in which the participants stayed in autonomous driving and no event is detected for 30 seconds. Then we extract the trials in which the participants decrease or increase their trust levels at least 15 seconds after the last event (if any) and at least 14 seconds before the next event (if any) by and , respectively. We assume that the change of participants’ trust was not influenced by events occurring 15 seconds before or 14 seconds after. In that case, stands for the trials in which participants decreased trust during only lane keeping driving. The results of and demonstrated that trust is more likely to increase after a period of driving without dealing with any events.

False Alarm. The results of and show the number of trials in which participants encountered an early false alarm and a late false alarm, respectively. As described in the experimental design in Section II, there were 76 trials with early false alarm and 76 trials with late false alarm. However, in one trial, the participant accidentally drove off the road and avoided the area where the late false alarm was designed. and show the trials where trust decreased within 10 seconds after a false alarm occurring. The results of and show the trials where trust increased within 10 seconds after a false alarm occurring. Of the early false alarms, 22.4% caused trust to decrease, while 28.0% of late false alarms caused to decrease. Of the early false alarms, 14.5% caused trust to increase, while 22.7% late false alarms caused to increase. In total, 25.2% of false alarms caused to trust to decrease, while 18.5% caused to increase.

Missing Alarm. We extracted trials when participants decreased and increased trust level within 10 seconds of a missing alarm event by using and

respectively. The results shows that the probability of trust decreasing after a missing alarm event is 67.1% while the probability of it after an activated alarm event is 51.3%. On the contrary, the results shows that the probability of trust increasing after a missing alarm event is 51.3% while the probability of it after an activated alarm event is 59.2%. In other words, events with alarms not being activated have a greater negative impact on trust level than those with activated alarms.

Iv-C Learning Individualized Parameters

The upper bound time we used in the STL formuale , , , and are 10 seconds. We assume that participants can react to change trust levels within 10 seconds in general. In fact, some participants reacted faster than others. To obtain a tighter reaction timing bound for each participant, we use the Temporal Logic Extractor (TeLEx) tool [jha2017telex] to learn the optimal STL parameter from each participant’s data. TeLEX takes the input of parametric STL formulae and each participant’s trial data, and output the learned parameter value of reaction timing bound for each participant.

Participant Mean SD
A 4.2 3.5 1.0 1.7 2.6 1.49
B 4.5 5.5 3.9 5.1 4.75 0.7
C 2.7 6.5 7.2 9.5 6.475 2.82
D 3.5 3.7 8.9 9.1 6.3 3.12
TABLE III: Reaction time (seconds)

Table III shows the parameters for four participants. Participant A used shorter time to increase trust with respect to early false alarm () and late false alarm () than participants C and D. Participant B had the smallest standard deviation of reaction time, potentially due to more focused engagement in the driving. The results in Table III

demonstrate that there is individual variability in human reaction time and trust change. However, the learned human reaction time may not be accurate due to the small data sample (i.e., only 25 unique trials were found to satisfy these four formulae for these four participants). Further experiments would be needed in order to estimate more accurate individual reaction time.

V Discussion and Conclusion

This paper used ANOVA and STL techniques to analyze the results in universal and detailed perspective, respectively. ANOVA does not provide the details such as whether false alarm will cause trust change. In addition, individual’s trust tendency varies from person to person. The result of ANOVA takes the group of participants into consideration but it lacks the personalized information.

STL framework, on the other hand, allows us to study the user-specific detail of the driving session and output individualized pattern. STL formulae have mathematically succinct form and can be defined to detect satisfaction or violation behavior of the driving system. With STL parameter synthesis to determine the optimal parameters, STL formulae can be generated to fit individual pattern. The structure of the formulae relies on the domain knowledge of the system designer. More research is required to learn new knowledge of the system directly from observed data. Furthermore, the analysis of physiological data typically requires pre-processing. The direct STL monitoring on physiological signals could be an extension exploration. We demonstrated that STL learning approach can be used to infer individual reaction time. However, we would need to conduct further experiments in order to learn more accurate parameter values.

The participants in this study are mostly college students with engineering backgrounds. A population with more diversity in age and knowledge background can contribute to more generalized analysis.

In conclusion, this paper presents a case study for trust on autonomous driving. We used ANOVA and STL technique to examine how possible factors affect human trust change. Our ANOVA results show that missing alarms have a significant impact on human trust while driving mode and weather condition do not. We also did ANOVA analysis on physiological data as a complementary indicator of trust. It shows that pupil size and number of peaks in GSR were sensitive to the effect of weather. The results of STL analysis show the variability in human reaction time and trust change.