Advanced driving assistant systems (ADAS) are fundamentally changing the role of drivers by superseding them in moment-to-moment driving tasks, including steering, braking, and accelerating[merat2012preface, tsugawa2011automated]. Increasingly, automation allows drivers to take their hands off the wheel and feet off the pedals over a longer period of time while driving. This shifts the role of a driver to a fallback-ready user or a passive monitor rather than an active controller [seppelt2019keeping, korber2015prediction]. However, this advantage does not completely free the driver from monitoring the dynamics of their traffic environment. On the contrary, such prolonged disengagement from the primary tasks of driving creates great risk because the driver loses the situational awareness [banks2016keep, parasuraman2010complacency] needed to react if the automated system unexpectedly deviate from its intended functionality, or even fails completely [seppelt2016potential, samuel2016minimum]. Considering the fact we are likely in a several-decade-long transition period to have fully-autonomous vehicles on roadways, ensuring traffic safety by predisposing human drivers to sustain their visual attention is critical [li2019learning].
At present, machine recognition of human visual attention allocation while driving is an active area of research which has the potential to increase context-aware safety [golestan2016situation]. However, the substantial variability in the visual attention allocation of an individual has been overlooked [rasouli2018joint]
or incorrectly predicted under complex environmental situations using existing machine learning algorithms[tawari2017computational, parsa2019real]. Specifically, these approaches fail to account for the complexity of situational contexts inherent in urban mobility [klingelschmitt2015managing]. From an engineering standpoint, the many unknowns of an internal system together with the complexity of an external environment make it inappropriate to pursue a single model to evaluate situation awareness [der2009aleatory]. Therefore, some of the models presented in the literature may not be accurate predictors of human visual attention (e.g. bottom-up models), especially when it comes to eye gaze allocation in the context of dynamic driving environments accompanying complex tasks [rothkopf2007task]
. Reinforcement learning (RL) and imitation learning offer a practical way to gain insight into the dynamics of an individual’s visual decision-making process.
2 Related Work
Visual attention is the perceptual process of selecting important and salient objects in view by which we are able to make sense of the environment during natural activities [tsotsos1995toward]. In a driving environment, a human or even an agent (i.e. autonomous driving systems) are constantly looking for hazardous elements. Once they are observed, visual attention is shifted towards them to remain aware of the new situation and hopefully avoid collisions [johnson2017closed]. More than 90% of driving-related safety critical information is leveraged through visual attention [parkes1995potential]. More notably, eye fixation on random and irrelevant objects in the environment has caused 7% of the crashes and 13% of near-crashes involving autonomous driving systems [klauer2006impact]. For those reasons, studying visual attention in driving is crucial.
A large amount of research has been dedicated to visual attention and eye behavior in psychology, neurobiology, engineering, and computer science [frintrop2010computational]. Psychologists and neurobiologists mainly focus on the mechanism in the brain that directs the eye to certain and salient areas of a scene [umilta2001know]
. Computer scientists then apply the findings from other disciplines to improve system performance through computational models. Computational implementations of visual attention models have been greatly studied and classified into two categories: bottom-up and top-down[bylinskii2015towards].
Recently, several driver attention prediction models using multimodal datasets, captured from high-resolution dash-cams, have been proposed [ye2018predicting, alletto2016dr, palazzi2018predicting]. Despite this increase, due to the small sample size and limited settings, their models restricted to the certain conditions and how the prediction results would be using low-resolution videos has not been explored. Conversely, a visual attention allocation model is required to be closer to the nature of visual decision-making of drivers in dynamic tasks. Considering the fact that visual attention allocation is a key in our learning process [leong2017dynamic] and how we collect rewards in dynamic environments, a RL framework can be used to explain human visual behavior [zhang2018agil]. RL has been used for modeling visual attention and showed its superiority to bottom-up and top-down models. Authors in [mnih2014recurrent] made one of the earliest attempts to combine deep RL and visual attention to identifying optimal task-related policies for selecting the next eye gaze location. In their image classification study, attention is treated as a sequential decision problem in which the next eye movement is based on the previous fixation allocation. In this study, we consider the sequence of eye movements of passive drivers as the behavioral aspect of visual attention and their eye fixation on the hazardous object as an indication of the visual decision making process. We identify the optimal task-relevant policies for detecting a hazardous object in rear-end collisions.
3 Framework Overview
For this study, we recruited 20 participants () from a university community. All of the participants were required to have a valid driver license and have at least one year of driving experience (). Eleven were categorized as novice drivers who had been driving for less than 36 months. Whereas, the remaining nine participants were categorized as experienced drivers, having held their driving license for more than 10 years. Participants were also required to have normal or corrected to normal vision. None of the participants had experience with eye-tracking technology. The experiment received ethical approval from our University’s Institutional Review Board.
Experimental setup: Experiments were conducted in an experiment booth with controlled lighting. The experiment was designed to maximize the accuracy of the eye tracker. The driving scenes were displayed on a -inch monitor with a pixel resolution of 2560 by 1440. Participants were seated approximately 60 cm away from the screen. Due to the sensitivity of the eye tracker, the vertical placement of the screen was adjusted such that the center of the screen was at eye-level for each participant. To control the lighting and minimize possible shadows, a Litepanels LED-daylight was used.
|Eye information||Gaze position||float||(1920x1080)||per frame|
|Pupil size||float||(0-7)||per frame|
|Fixation duration||categorical||(FA, MA, IA)||per frame|
|Road environment||Traffic density||categorical||(C/NC)||per frame|
|Weather condition||binary||0/1||whole video|
|Road type||binary||0/1||whole video|
|Day time||binary||0/1||whole video|
|Vehicle trajectory||Speed||binary||(H, L)||per frame|
1: FA; Fully-attentive, MA; Moderately-attentive, and IA; Inattentive drivers, 2: C: crowded; more than 5 vehicles in the scene; NC: not crowded;less than 5 vehicles in the scene, 3: H: High; more than 15 mph, L; Low; less than 15 mph.
Eye-tracking: Data was collected using the screen-mounted Tobii X3-120 system with a sampling rate of 120 Hz. The eye-tracker was mounted under the screen of the monitor placed in front of the participants. The system had to be calibrated for each participant using the Tobii Pro Studio animating nine calibration points. Calibration accuracy was then recorded to be within of visual angle for both axes of all participants. Following this, the sample data was sent to iMotions (a biometric research platform, 2016) which allowed collection of the synchronized physiological data. Data related to the eye tracking can be found in Table 1.
Since the future role of drivers in semi-autonomous vehicles will be shifted to a fallback-ready role, this study was designed to replicate and evaluate these conditions. Therefore, participants were first familiarized with the protocol and asked to imagine themselves to be the driver of a semi-autonomous vehicle who is required to visually perceive the hazard cues and be aware of the surroundings in order to respond as required. They were then given the supervisory role as they watched real-world rear-end collision videos while we captured their visual attention allocation.
A set of 21 real-world dash-cam forward collision videos, on average thirty seconds long, were chosen from the second strategic highway research program (SHRP2) of the Naturalistic Driving Study (NDS) [dingus2015naturalistic]. The videos were captured using a camera mounted on a dashboard behind the windshield and consisted of crash or near-crash incidents. We paid special attention to the video quality in order to demonstrate collisions in various environmental contexts (e.g. sunny and rainy), traffic conditions (e.g. traffic density), and road types (e.g. highway vs. town), details of the dash-cam videos can be found in Table 1. Before the data collection, participants practiced their tasks on two videos to become acquainted with the experimental procedures and purposes. The study was conducted in two sessions; the first session had eleven videos and the second session had ten videos. Each participant took a five to ten minute break between the sessions. The data collection process took approximately half an hour from beginning to end.
4 Rear End Collision Perception
A rear-end collision can happen if the driver is not fully aware of the situation and misses important hazards. Since in a direct and not occluded driver’s view, the brake lights of the lead car are amongst the brightest objects of the scene and red in color, in our study they are considered as the main hazardous stimuli of forward collision [wolfe2019detection]. When observing the actions of the lead car (indicated by brake lights), drivers are supposed to initially perceive the current state of the driving by relying on their sensory inputs followed by a set of right actions [o2008vehicle]. This is a dynamic visual decision making process whereby drivers with proper situation awareness allocate their attention to hazardous objects. For modeling visual attention allocation, therefore, we consider the brake lights of the target car as the main hazard stimuli that the participants must detect and pay proper attention before the crash occurs. The time window from the first appearance of the brake lights to one second before the accident is Time-to-the-collision (TTC; as the videos are the core of our modeling, TTC is determined as a frame-to-collision, instead) [vogelpohl2019asleep]. TTC is crucial for any strategy that the automated system needs to take (e.g. sending take-over alarms). The proposed models ascertain the important environmental (extracted from the videos) and visual (extracted from the eyes) features needed for properly detecting the location of brake lights and allocating enough visual attention during TTC. The following section explains the procedure of extracting the aforementioned features.
5 Problem Formulation
We cast the problem of hazardous object detection as a Markov Decision Process (MDP) since this setting provides a formal framework to model the visual attention allocation as part of the control process where a human, or agent, actively chooses task-relevant information from the environment to make a sequence of decisions and achieve its goals. In our formulation, theenvironment is a frame of the videos, in which the agent, participants, detect a hazardous object using a set of actions during TTC, as shown in Fig. 1. The goal of the agent is to accurately and effectively detect the lead car brake light during TTC.
5.0.1 Markov Decision Process
A Markov Decision Process (MDP) is a mathematical framework to solve stochastic decision problems which models the interaction between the agent and the environment [minkoff1993markov]. The agent is a learner (decision maker) that interacts with the environment and receives a reward at each time step and exerts an action on the environment that may change its future states [sutton1998introduction, sutton2018reinforcement]. A typical MDP is represented using a 5-tuple , where is a (finite) set of possible states that represents a dynamic environment, is a (finite) set of available actions that the agent can select at a certain state,
is the state transition probability matrix that provides the probability of the system transition between every pair of the states,is the discount rate that guarantees the convergence of total returns, and is the reward function that specifies the reward gained at a specific state by taken a certain action [sutton2018reinforcement].
In general, the goal of an MDP is to find an optimal policy for the agent, where the policy specifies the action to take at the current state . The main goal is to find the optimal policy that maximizes the expected return which simply is the cumulative discounted reward over an infinite horizon: where is the discount rate, the term represents the reward that the agent receives by taking an action determined by policy at the present state .
We extract a total of 450 observations per participant, per video (15 observations per second). Each observation contains the environmental, the visual features (see section LABEL:sec:pipeline), the size and location of the hazardous object if it appears in that frame, the TTC, and additional features related to the dynamics of the environment such as the light (day/night), road type (town/highway), weather (sunny/rainy) (see Table 1). It is worth mentioning that the hazardous object location and its size change in each frame, as shown in Fig. 2.
Eye gaze position is represented as a cell number in a grid-based space of each frame. The agent could use the post-decision state approach to go to the next state by taking any actions from the action set [schmid2012solving]. Any of the taken actions make a change to the location of the gaze for the next step. It should be noted that the environment (a dash-cam video scene) is changing dynamically; therefore, each cell can be visited more than once. In this study, the agent does not follow an inhibition-of-return (IOR) mechanism in which the currently attended region is prevented from being attended again [itti2001computational].
5.0.4 Reward function
For this work, we use a linear combination of the features to represent the reward function: where
is the weight vector, andis the feature vector where each element is a single feature point in state-action space. Feature points are binary (certain argument exists or not) and categorical.
|Hazardous state||0/1||RF, SVM,|
|Traffic density||categorical||RF, SVM,|
|Time-to-first-fixation (TTFF)||number||RF, SVM,|
|Time-to-collision (TTC)||number||RF, SVM,|
Here, we introduce RL and IRL frameworks where the agent is interacting with a stochastic environment modeled as an MDP. The goal in an RL problem is to find a policy that is optimal in a certain sense.
6.1 Reinforcement Learning
Reinforcement learning is about learning what actions to take in order to maximize a numerical reward. The agent voluntarily explores the set of actions that lead it to the highest rewards. By assigning small negative rewards for every action and large positive rewards for successful task completion, the agent can be trained to act in a task-oriented and efficient manner without being provided with explicit examples of ideal behavior or even how a given task needs to be completed [sutton1998introduction, sutton2018reinforcement]. The emphasis of RL is on learning by an agent which directly interacts with its surrounding environment, without requiring perfect supervision or complete models of the environment [sutton2018reinforcement, ng2000algorithms, littman2015reinforcement, mathe2016reinforcement]. Therefore, RL approaches to lighten the strong dependency on environment models and dynamics [makantasis2019deep, wang2014exploration].
Driving is a multi-agent interaction problem; in our case, we focus on a rear-end collision, which is an interaction with other cars on the road. This type of interaction depends on behaviors of other drivers so it is full of uncertainty. It has even been argued that drivers employ some sort of online reinforcement learning to understand the behavior of other drivers [sallab2017deep].
In this study, we capture a pattern of visual attention allocation of FA participants to use for quantifying the probability of missing the hazardous object in any frame of the videos. The goal is replicating the eye fixation behaviors of the FA participants using RL. The MDP model of the visual attention allocation has been used to obtain such a pattern. We use Q-learning that learns the optimal policy employing the agent’s experiences.
6.1.1 Q-learning in visual attention allocation sequence
Q-learning is a simple way for the agent to learn how to act optimally in an MDP formulation. It works by successively improving its evaluations of the quality of particular actions at certain states [watkins1992q]. The state-action value emphasizes on the value of the first-choice actions starting at the current state. indicates the expected discounted cumulative reward starting at state , taking action and then following policy , afterwards. The values of two sequential states of the MDP are related and satisfy the following equation (Bellman equation).
We use Deep Q-Network (DQN) that utilizes deep neural network to extend the action-value functions of the agent by using Q-learning. As shown in Fig.LABEL:fig:IRL-RL-a, the DQN takes the state representations as input and gives the value of seven actions as output. A DQN incorporates a replay-memory to collect various experiences and learns from them in the long run. Therefore, the tuples of observed experiences have been stored in a replay memory. In DQN, the policy followed is -greedy which gradually shifts from exploration to exploitation, according to the value of . During exploration, the agent is allowed to chose one random action from a set of positive actions (or any action if there are not any positive actions). During exploitation, the agent selects actions greedily based on the learned policy and learns from its successes and failures. For the exploration, it does not proceed with random actions. In our case, a guided-exploration strategy [abbeel2004apprenticeship] has been used, which is based on a demonstration by experts (all of the FA participants) to the agent since the environment knows the ground truth of detected hazardous object and the reward function is calculated to the linear function.
6.2 Inverse Reinforcement Learning
So far, we have learned the visual attention reward function in RL through extracting important features by utilizing RF and SVM algorithms for a desired eye fixation policy (obtained from FA participants, see section 5.0.4 for further details). The approach in section 6.1 is not convenient to use if the prior knowledge of the reward function is not sufficient. However, one way to approach this problem is to learn the optimal visual attention allocation policy from demonstrations performed by FA participants. This approach is called Inverse Reinforcement Learning (IRL) framework [zhifei2012survey, ziebart2008maximum].
Abbeel and Ng motivated the use of IRL for learning driving styles [abbeel2004apprenticeship]. They pointed out that drivers typically trade-off many different factors such as avoiding collisions, monitoring traffic rules, and minimizing the risk of unexpected events, all of which we would have to weigh when specifying a reward function [sun2018probabilistic]. Driver behavior is determined by a large number of parameters, and manually defining them involves tedious tuning by motion planning experts [rosbach2019driving, kuderer2015learning, sharifzadeh2016learning].
We apply a “learning from demonstration approach” that allows the participants to simply demonstrate their preferred visual attention allocation sequence. We explain maximum entropy principle that can be used to recover the unknown reward function, which can then be used to learn the eye fixation policy using the given demonstrations, which is the histories of the agent’s behaviors consisting of the past states and actions.
6.2.1 Maximum Entropy
In the basic maximum entropy formulation, one is given a set of samples from a target distribution as well as a set of constraints on this distribution, and then one estimates its distribution using the maximum entropy w.r.t to the constraints[ziebart2008maximum, wulfmeier2015maximum]. The maximum entropy principle finds a distribution satisfying the constraints with the largest remaining uncertainty. In inverse reinforcement learning problems, however, one is given a number of time histories of the agent’s behaviors consisting of the past states and actions [sun2018probabilistic], demonstrations. The maximum entropy principle has been applied to solve inverse reinforcement learning problems by Ziebart [ziebart2008maximum], for the condition where the reward function depends only on the current state, and it was formulated via a linear combination of feature functions, where and are the weight and feature vectors, respectively. Note that in Ziebart work, the feature vector is only a function of state , and the actions were not considered.
IRL, when used with maximum entropy, needs to have knowledge about the state transition model . Thus, we employ Q-learning to obtain insight into the visitation count of each state-action-state sequence (e.g. calculate the probability for each possible result of the state transition). However, we do not need to know the state transition model since it is single step maximum entropy. You’s work provides further information about this process [you2019advanced].
We formulate the visual attention allocation sequence of passive drivers with both the RL and IRL frameworks and present the results of our analysis below. All implementation code can be found on GitHub111https://github.com/soniabaee/eyeCar.
7.1 Visual Attention Allocation
Since our RL and IRL frameworks were built upon the performance of FA participants, we only focus on them and name the eye fixation pattern of the experienced FA participants as “FA: experienced”, and the pattern of the novice FA participants as “FA: novice”. These patterns are highly correlated to the driving experience of FA participants (see Fig. 4-left plot). We ran a factorial repeated-measure ANOVA on two visual features- the fixation duration and TTFF- of participants, using driving experience (“experienced” vs “novice”) as a factor. Besides, as a manipulation check, the experienced participants had significantly higher fixation duration on the brake lights than novice participants (). It also shows that experienced drivers had significantly lower TTFF on the brake lights than novice drivers ().
The overall characteristics of each eye fixation pattern have been visualized in Fig. 4-right plot (to enhance the understating of these patterns, the inattentive drivers’ eye fixation pattern is also visualized). The novice drivers have fewer eye fixations points with short duration during TTC but they frequently detect the hazardous object. Whereas the experienced drivers have more eye fixations points with longer duration during TTC but they did not detect the hazardous object as frequent as the novice drivers. These results are supported by the findings in [underwood2007visual] and [rasouli2018joint]. Moreover, the experienced drivers found the hazardous object during TTC much faster than the novice drivers as indicated by sooner TTFF.
Based on these distinguishable patterns, we show an eye fixation policy for each of them by using the RL framework. These policies have been found by using the learned features of the reward function in section 5.0.4. The Table 3 shows the weight values and of corresponding features for each of the policies. The desired visual attention allocation by learning is to show FA:novice pattern, which can be described as follows: 1) the drivers try to find the target vehicle brake light as soon as their notice of existence of the hazardous object takes place, 2) the drivers actively look for the target vehicle brake light, 3) the detected hazardous object enters the spectrum of a drivers’ awareness during TTC.
The desired visual attention allocation by learning is to show FA:experienced pattern, which can be described as follows: 1) the drivers detect the lead car brake light at earliest opportunity, mainly when the hazardous object appears, 2) the drivers direct their attention to other objects in a scene, 3) the drivers switch their eyes over to the hazardous object infrequently during TTC.
The desired eye fixation pattern depends on the initial state of the drivers and their awareness of the environmental dynamic (understanding of the context). As shown int FA:novice and FA:experienced patterns, early detection of the hazardous object can often lead to not having a collision however; later detection could increase the likelihood of having a crash. Therefore, TTFF could be a good indicator of a potential collision and important features in a reward function as a punishment. Additionally, the target vehicle brake light location is the ideal location for a driver to look at during TTC to have a better understanding about the driving situation. Thus, the location of the hazardous object has a greater reward than other locations in the environment, which encourage correct response to rewards.
|Time to First Fixation||-0.576||-0.754|
Due to the growing number of vehicles and promising advancements of assisting systems, the cooperation of human drivers and automated systems is considered as a key emerging technology for enhancement of road safety and driving experience. The purpose of this study was to design a framework whereby not only the active engagement of drivers for perception and assessment of hazards can be measured, but also their level of attentiveness with respect to the most attentive drivers can be compared. The proposed frameworks identify how the process of visual attention allocation of a passive driver who does an excellent job of performing a supervisory task can be learned. This is the first study, to the best of our knowledge, to examine the association of eye fixation patterns of passive drivers and a hazardous object by using RL and IRL. To that end, we replicated the situation where a passive driver has a fall-back ready role by presenting real-world rear-end videos to participants. The sequence of eye movements were used as the behavioral aspect of visual attention and the eye fixation duration on an area of interest (brake lights of the lead car) was used as an indication of visual decision making.
Furthermore, environmental and visual features allowed us to identify the distinct patterns of eye fixation in rear-end collisions. The results suggested that the order of the attention, the average fixation duration, and the amount of time that takes a participant to look at the hazardous object are crucial features to properly detect and pay attention to a hazardous object. Among attentive participants, we found two different patterns which are highly correlated with their driving experience. Experienced drivers have more frequent eye fixations during TTC and longer fixation durations on the hazardous object, whereas novice drivers have fewer and shorter fixation durations on the hazardous object during TTC. Experienced drivers were more attentive to the situational dynamics and were able to identify potentially hazardous objects before any collisions occurred. These results are consistent with findings of [underwood2007visual] and [rasouli2018joint]. To jointly recover novice and experienced driver policies and reward functions, we implemented a model-free IRL method based on the maximum entropy principle. The results showed that the learned features in the RL framework are accurately mapped with the reward function from IRL (FA:experienced, 96%, FA:novice; 99%).
The primary purpose of this study was to conduct an initial framework of visual attention allocation of drivers in semi-autonomous vehicles. We used a stochastic Markov Decision process to model driver eye fixation behavior and achieve desired visual attention allocation behaviors using both RL and IRL framework. By designing the driver’s reward function, we can show typical eye behavior using the DQN algorithm to learn the corresponding optimal policies. To be able to recover the policy and the reward function from data, we use model-free IRL based on the maximum entropy principle. Our research reveals several cues that attentive drivers seem to be relying upon to anticipate hazardous objects, and the analytical tools developed could lead to more extensive studies with more realistic driving conditions and further work on this area.