A Description of a Subtask Dataset with Glances

02/07/2019 ∙ by B. D. Sawyer, et al. ∙ MIT 0

This paper describes a set of data made available that contains detailed subtask coding of interactions with several production vehicle human machine interfaces (HMIs) on open roadways, along with accompanying eyeglance data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

A Description of a Subtask Dataset with Glances

B. D. Sawyer1, Sean Seaman2, Linda Angell2, Jon Dobres1, Bruce Mehler1, Bryan Reimer1

1 AgeLab, Massachusetts Institute of Technology, 2 Touchstone Evaluations, Inc.

Author Note

Address correspondence to Dr. Ben D. Sawyer; bsawyer@mit.edu or sawyer@inhumanfactors.com


This paper describes a set of data made available that contains detailed subtask coding of interactions with several production vehicle HMIs on open roadways, along with accompanying eyeglance data.


The purpose of this paper is to describe a dataset prepared specifically for understanding the relationship between the modality demands of subtasks and subtask- and task-level glance behavior. This data is drawn from several studies conducted by MIT’s AgeLab (see Method, below) on open roads in production vehicles. The data were originally manually dual-coded and mediated by video analysts for glance behavior; the data were then coded a second time by video analysts for subtask behavior (and additional glance behavior, when needed).


Each of these studies included in this dataset was conducted on Boston-area highways in six different production vehicles that represented modern HMIs with combinations of visual-manual and voice-based infotainment tasks. These studies are described in depth in Mehler et al. (2016), Reimer et al. (2015), Mehler et al. (2014a), Mehler et al. (2014b), Mehler et al. (2015a), and Mehler et al. (2015b), although in the shared dataset the correspondence between specific study publication and study code have been made arbitrary to protect the privacy of participants. Across all studies, participants ranged in age from 20 to 69 years and were evenly split between male and female (49.4 of participants were female). Also, across all studies, participants possessed valid driver’s licenses, and were naïve as to the purpose of the studies for which they were recruited. Each study’s primary purpose was to evaluate the visual demand associated with HMI interaction. To assess visual demand, videos were taken of participants’ faces as they performed tasks, as well as each vehicle’s center stack. Audio recordings of HMI interactions were also made.

Glance analysts utilized the recordings of participants’ faces to manually code the location to which each participant was looking at any given time. Glances and transitions were coded following ISO 15007 IS (2014), with each glance subtending the duration from the first frame identified of the leading transition to the new glance location, through the fixation at the new location, and ending immediately before the first frame indicating the transition to the next location. The locations included in this dataset are described below in Table 2. Tasks with video that could not be classified were excluded from this dataset.

Subtask coding began with the review of 10 participants tasks, using a combination of face video, HMI video, and audio recording, in order to create descriptive typical interaction between system and operator over the course of the task. This script was then subdivided into subtasks, according to the antiphony framework (Sawyer, Mehler Reimer, 2018). For each participant each subtask onset and offset were coded, or the first frame indicating movement toward an interaction through the completion of the interaction. Notably, this period often extended beyond the commonly used time window (e.g., NHTSA, 2013) between a clear cue to begin and a clear cue the participant is done. Because of this dataset’s focus on capturing all glance behavior associated with the subtasks used to complete a task, glance coding was done on the entire period subtended between the first subtask and the last subtask, rather than the more common recording period between a clear begin cue and a done cue. Thus, subtask and glance coding were entirely overlapping periods, and after aligning the two datasets, each subtask had associated glance behavior. Tasks where entirely overlapping data were not available were excluded (< 1 of available data).

Each participant had their own unique path through the interface, in terms of subtasks. This was in part due to errors and subsequent necessary corrections, but non-erroneous unique paths were found especially within voice interfaces, where multiple paths often existed to achieve a given task. We also encountered challenges with manual tasks. An individual button press, especially one performed with the sequence of other button presses, may be so brief as to subtended glances, rather than being subtended by glances. The method we undertook to simplify this issue was the grouping of multiple sequential button presses into ‘operation groups’, signified by the same metadata as any other touch operation. This was initially developed to handle phone numbers of varying lengths, but soon proved a useful tool to handle a variety of similar situations.

Each subtask (or step) of a task was characterized by the modalities of attentional resources that it required of the driver. These included touch (for manual interactions), hear (for auditory cues and messages provided by the HMI), speak (for voice-based interactions), and vision. However, the interface demand placed on the human’s visual resources was handled differently in order to prevent circularity and overlap with the measures that the model was seeking to predict. The interface demand on the human’s visual resources was conceptualized as display monitoring resources – and coded in terms of what was made available by the vehicle system for display (for visually-presented task-relevant information). Thus, touch, hear, speak, and display monitoring (T,H,S,D) resources were coded in a binary fashion; a 0 was used to indicate the absence of this demand for a particular subtask, while a 1

was used to indicate the presence. The display demand modality was different; this score represented the number of discrete displays containing task-relevant visual information, and this score ranged from 0 (no relevant displays) to 2 (2 relevant displays). As such, vector representation was available for subtasks subtending classical auditory-vocal interactions (0010), visual-manual interactions (1001), and mixed mode interactions (1010), as well as other combinations of interface requirement. Any time before the infotainment system or the driver engaged in a subtask was coded as

latency, per the antiphony cycle, and was represented with both hear and display monitoring requirements (0101). This represented an assumption that many interfaces are interruptible, or may enter confirmation modes, or even self-terminate, and the driver must monitor for cues to such changes in the structure of the task. These canonical demands for each subtask of a task were only included if latency was observed in the video.


The dataset is made available in JSON format (see Introducing JSON, (n.d.) at https://www.json.org/ for a description). Each row contains data about one subtask for one trial of a task, organized by participant, study, and, when more than one vehicle is present in a study, by vehicle. Fields associated with each row are described in Table 1. The first glance indicated for each subtask is the location to which the participant was looking when the subtask was coded as beginning.

Code Definition
Study String: uniquely (but arbitrarily) identifies each study
Participant Integer: uniquely (but arbitrarily) identifies each participant within the dataset
TaskCode String: uniquely identifies each task within a study
Trial Integer: ordinal identifier of each repetition of each task (this value is typically 1, but occasionally more than 1 trial is present)
StartTime Decimal: start time of subtask, in seconds; useful for verifying temporal order of subtasks within a task.
Vehicle String: uniquely (but arbitrarily) identifies each vehicle
Gender String: gender of participant (M or F)
Age Integer: age of participant, in years
Display Integer: display metadata code
Touch Integer: touch metadata code
Hear Integer: hear metadata code
Speak Integer: speak metadata code
Latency Integer: latency metadata code
SubtaskDuration Decimal: length of trial of subtask, in seconds
GlanceStartSeconds Array of decimals: start time of each glance away from the road
GlanceEndSeconds Array of decimals: end time of each glance away from the road
GlanceLocation Array of strings: locations of each glance (see Table 2)
Table 1: . Glance locations and descriptions.

Glance locations for glances included in this dataset are described in Table 2.

Glance location Description
road The forward windshield
center stack The center stack of each vehicle
left Left window and/or left side mirror
right Right window and/or right side mirror
rearview mirror Rearview mirror mounted on or near windshield
instrument cluster Instrument cluster (with speedometer, etc.)
right blind spot Over-the-shoulder, to the right
left blind spot Over-the-shoulder, to the left
passenger The passenger in the front passenger seat
other All other non-road locations


It is the authors’ hope that this dataset will make a substantial contribution to computational modelling of roadway safety.


0.5in0.0in International Organization for Standardization (2014). ISO 15007-1 Road vehicles—Measurement of driver visual behaviour with respect to transport information and control systems—Part 1: Definitions and parameters. ISO, Geneva, Switzerland.

0.5in0.0in Introducing JSON. (n.d.). Retrieved August 17, 2018, from http://www.json.org/

0.5in0.0in Mehler, B., Kidd, D., Reimer, B., Reagan, I., Dobres, J., McCartt, A. (2016). Multi-modal assessment of on-road demand of voice and manual phone calling and voice navigation entry across two embedded vehicle systems. Ergonomics, 59, 344–367.

0.5in0.0in Mehler, B., Reimer, B., Dobres, J., Coughlin, J.F. (2015a). Assessing the Demands of Voice Based In-Vehicle Interfaces - Phase II Experiment 3 - 2015 Toyota Corolla. Massachusetts Institute of Technology, 2015.

0.5in0.0in Mehler, B., Reimer, B., Dobres, J., McAnulty, H., Mehler, A., Munger, D., Coughlin, J.F. (2014a). Further Evaluation of the Effects of a Production Level Voice-Command Interface on Driver Behavior: Replication and a Consideration of the Significance of Training Method. MIT AgeLab, Cambridge, MA.

0.5in0.0in Mehler, B., Reimer, B., Dobres, J., McAnulty, H., Coughlin, J.F. (2015b). Assessing the Demands of Voice Based In-Vehicle Interfaces -Phase II Experiment 1 - 2014 Chevrolet Impala. Massachusetts Institute of Technology, Cambridge, MA.

0.5in0.0in Mehler, B., Reimer, B., McAnulty, H., Dobres, J., Lee, J., Coughlin, J.F. (2014b). Assessing the Demands of Voice Based In-Vehicle Interfaces - Phase II Experiment 2 - 2014 Mercedes CLA. Massachusetts Institute of Technology, Cambridge, MA, 2015.

0.5in0.0in National Highway Traffic Safety Administration. (2012). Visual-Manual NHTSA Driver Distraction Guidelines for In-Vehicle Electronic Devices. National Highway Traffic Safety Administration, Washington, D.C.

0.5in0.0in Reimer, B., Mehler, B., Dobres, J., Coughlin, J. F. (2015). Assessing the Demands of Voice Based In-Vehicle Interfaces-Phase II Experiment 4-An Exploratory Study of Driver Behavior With and Without Assistive Cruise Control (ACC)(D). MIT AgeLab Technical Report 2015-15.

0.5in0.0in Sawyer, B. D., Mehler, B., Reimer, B. (2017). Toward an Antiphony Framework for Dividing Tasks into Subtasks.