”The eyes which are the windows of the soul.“
— Plato (427 BC - 347 BC)
Eye movements are crucial but implicit cues for determining people’s attention. Gaze estimation enables the study of visual perception mechanisms in humans, and has been used in many fields, such as action recognition[Fathaliyan-Frontiers2018], situation awareness estimation[Dini-IROS2017], and driver attention analysis[Maekawa-ICCVW2019]. It is also a non-verbal communication method, and thus, it can be applied to shared autonomy [Admoni-AAAIS2016] or teleoperation [Webb-ACC2016] in the context of Human-Robot Interaction (HRI).
Recent studies have enabled human attention mapping in 3D environments using mobile eye-tracking glasses[Munn-ETRA2008][Paletta-IRCV2013]
. Most approaches compute a 3D gaze by extending a measured 2D gaze vector from a camera pose estimated by visual localization or motion capture systems in a pre-built static 3D map. They are assumed to operate in static environments; however, the real world is a place of constant change, with objects appearing and disappearing from the scenes. Human attention analysis in both spatial and temporal domains is still an open problem, which when solved will help determine human behavior in the real world.
To address this issue, we propose a comprehensive framework for 4D attention mapping (see Fig.1). The main contributions of this study are three-fold:
A new framework, 4D Attention, is proposed for capturing human attention to static and dynamic objects by assembling 6-DoF camera localization, rapid gaze projection, and instant dynamic object reconstruction. Human attention is accumulated on each 3D mesh model, which makes gaze mapping much more meaningful, for example, the semantic analysis of perceptual activities rather than generating cluttered 3D gaze point clouds.
The framework is designed so that scene rendering plays a central role. This makes the entire system simple and does not require additional map or object model representations for localization and attention mapping. Additionally, it facilitates a unified attention-mapping procedure regardless of the target objects.
We examined the accuracy and precision of our method using a moving target board whose ground truth position was measured by a total station. Additional experiments for monitoring human attention in the real world demonstrated the capability of analyzing human attention in static and dynamic targets including maps, household items, and people, during the free movement of the subject.
Ii Related Work
Eye movement patterns: Eye movements imply visual perception activities. Several approaches have inferred or determined perceptual activities based on the observations from electrooculography (EOG). Bulling et al. [Bulling-TPAMI2011]Ishimaru-UbiComp2014] also determined daily activities including typing, reading, eating, and talking, using signals from EOG glasses. This approach allows us to identify the current activity of a subject without complex settings, and can be applied to HCI to provide relevant services.
2D contextual analysis: However, human beings live in a context. Visual perception activities are not independent of the surrounding environment; in fact, they are induced by “attractive” objects in the scene. Eye-tracking and gaze overlay on 2D camera views make it possible to determine the focus of the subject, as in [Pelz-EI2011]. For semantic human attention analysis in natural environments, Fritz and Paletta [Fritz-ICIP2010] introduced object recognition in mobile eye tracking using local image descriptors. A similar approach can be observed in [Toyama-ETRA2012], which identifies objects fixated by the subject for a museum guide. [Harmening-SAGA2013] further progressed toward online object-of-interest recognition using a hierarchical visual feature representation.
3D gaze mapping: For the holistic estimation of human attention, recent techniques have attempted to obtain fixations in the real 3D world leaving the image plane. [Pfeiffer-ETRA2012] and [Dini-IROS2017] extended 2D gaze mapping by combining it with a motion capture system to track the pose of gaze glasses, which enables the measurement of the 3D point of interest. [Pfeiffer-ETRA2016a] built a similar system relying on visual markers for monocular camera tracking and 3D gaze analysis. However, they require a complex setup of multiple sensors, making the measurement area small and unscalable to large environments. Thus, several approaches compute the 3D gaze by localizing an agile monocular camera using visual localization or structure-from-motion. [Munn-ETRA2008] was the pioneering work, and was followed by promising techniques such as [Paletta-IRCV2013, Hagihara-AH2018] where they estimated camera poses using visual features and projected 3D gaze information onto the pre-built 3D environment map.
Toward attention analysis in the real world: 3D gaze mapping facilitates the analysis of human attention regardless of the scale of the environment; however, they still operate only in the static environment. Attention analysis in dynamic situations is still an open problem; it is necessary to address the spatio-temporal attention analysis to truly comprehend perceptual activities in the real world.
Iii Proposed Method
Iii-a System overview
In this study, we propose a comprehensive framework to capture 4D human attention, which is attention in the spatial and temporal domains in dynamic environments. A schematic overview of the proposed system is depicted in Fig.2. Obtaining 4D human attention from eye-tracking glasses with a scene camera has three main problems that need to be solved: robust camera localization, rapid 3D gaze mapping, and instant processing of dynamic objects.
Principally, 4D attention mapping is performed by projecting a first-person 2D human gaze onto a 3D environment map (static) and moving objects (dynamic). It first requires accurate and stable 6-DoF camera localization even in dynamic environments, which means that appearance of the pre-built 3D map and current view can be significantly changed. Additionally, given the camera pose, the system has to compute the intersection of the gaze ray and target object surface in real-time to record the 3D distribution of the subject’s interest. Furthermore, dynamic objects such as humans or daily objects should not stay in the same position, but should rather change their poses. Therefore, they cannot be captured in the 3D map in advance; instead, they should be processed on the fly.
In this section, we describe the major components of the framework shown in Fig.2 that are assembled to address these issues and capture 4D attention in the real world.
Iii-B1 Monocular camera localization
Visual localization is used to infer the pose of an agile monocular camera in a given 3D map. It can be categorized as either indirect methods via feature point matching, or direct methods via appearance comparison. Although major 3D gaze mapping methods[Paletta-IRCV2013][Hagihara-AH2018] rely on indirect methods to estimate the camera pose, they require the construction and maintenance of an extra feature point 3D map for localization. As will be explained later in Section III-C, the subject’s gaze is projected and accumulated on the dense 3D environment map (or dynamic object models); thus, the requirement doubles the map building cost. It also incurs other problems such as a 7-DoF exact alignment (including scale) between the environment and feature point maps.
Therefore, for a simple and straightforward system, we employ a direct localization method, specifically C[Oishi-RAL2020], which facilitates the localization of the agile monocular camera only with the colored 3D environment map. It leverages the information-theoretic cost, the Normalized Information Distance (NID), to directly evaluate the appearance similarity between the current camera view and 3D map. It achieves high robustness to large appearance changes owing to lighting conditions, dynamic obstacles, or different sensor properties[Oishi-RAL2020], and results in minimal effort in map management.
Given the current view , C estimates the camera pose in the world coordinate system via local tracking against a synthetic key frame rendered at a known pose :
C reduces the localization problem to alternate local tracking and occasional key frame rendering for efficiency, which leads to 6-DoF real-time localization regardless of the 3D map scale.
The NID metric between the current frame and key frame is given as follows:
where and denote the joint entropy and mutual information calculated based on the color co-occurrence in and , respectively. To determine the most likely relative pose , gradient-based optimization is performed. Specifically, starting from the given initial guess or previously estimated pose, the BFGS is employed to iteratively solve Eq.1 according to the Jacobian of the NID as follows:
Iii-B2 Visual-Inertial integration for rapid head and eye movement tracking
C is capable of providing reliable camera poses at several tens of hertz. To track the rapid head movements of the subjects, we further fuse the localization results and measurements from an Inertial Measurement Unit (IMU) calibrated to the camera in a loosely coupled manner[Lynen-IROS2013]. The framework allows us to achieve more than several hundreds of hertz estimation rates according to the IMU rates. Simultaneously, it significantly stabilizes visual localization by forming a closed loop that feeds the output pose into the localizer as the next initial guess of the optimization. Localization boosting and stabilization are suitable for real-time gaze projection, as described in the following section.
Iii-C 3D gaze projection onto the environment map
Given the camera pose (subject’s head pose) and gaze position on the 2D image, the 3D human gaze can be recovered by generating a 3D ray beginning from the camera through the gaze point. To determine the fixation point, the intersection of the gaze ray and target object must be calculated.
Ray casting can be computationally expensive for real-time operation. Therefore, Paletta et al. [Paletta-IRCV2013] pre-computed a hierarchical map representation, specifically, an Oriented Bounding Box Tree (OBB-Tree), and traversed the tree to rapidly find the intersection. In [Takemura-ToHMS2014] and [Matsumoto-MobileHCI2019], the authors estimated the 3D gaze point by first applying Delaunay triangulation to the feature point map, detecting the triangular plane that includes the 2D gaze, and finally investing the sub-mesh 3D gaze point into the world coordinate system from the triangle vertices. Although these methods work efficiently, they require pre-computation to build certain data structures for 3D gaze mapping, and their resolutions significantly affect the balance between the runtime computation cost and mapping accuracy. Furthermore, when dealing with dynamic objects that are not included in the pre-built 3D environment map, a more flexible scheme that does not require the construction of the data structure each time is preferable.
Thus, for a unified framework of human gaze projection, we propose ID texture mapping as depicted in Fig.3. Texture mapping is a very popular method for attaching a highly detailed appearance to a geometric model that provides realistic rendering images. Given a 3D mesh model, its texture image, and per-vertex UV coordinates, we can generate a textured 3D model with GPU acceleration. Any texture images are available in texture mapping; therefore, we attach a 32-bit integer texture that contains an unique ID of each pixel in its position, for example, , for gaze projection. Specifically, we determine the pixels that are currently observable by rendering the 3D map from the camera pose with the ID texture, and directly find the 3D gaze point by accessing the pixel corresponding to the 2D gaze point.
In addition to the simple setup and direct 2D-3D gaze association, the framework offers other benefits with the use of different types of textures. For example, by preparing another texture filled with zero and counting gaze hits, attention accumulation can be easily managed on a 2D image similar to the attention texture proposed in [Pfeiffer-ETRA2016b]. Additionally, overlaying a texture with an object class or semantics on the ID texture enables the semantic understanding of the subject’s perceptual activities [Hagihara-AH2018] in a unified pipeline.
ID texture mapping provides a simple yet efficient way of projecting the human gaze onto any geometric model, which is not limited to the map data. In the next section, we extend this framework to dynamic objects for 4D attention mapping.
|(a) Evaluation 1: Static target Walking around||(b) Evaluation 2: Dynamic target Standing still||(c) Evaluation 3: Dyamic target Following|
Iii-D Dynamic object handling for 4D attention mapping
Objects that do not exist in the map building phase cannot be stored in the 3D environment map, which means that the map data should only record static objects. However, many dynamic objects such as humans or household items are observed in daily life, and they seem to have “illegally” appeared in the static 3D map. The temporal gap between the mapping and runtime phases causes the absence or presence of dynamic objects, which leads to incorrect gaze projection.
Most conventional works only focus on static scenes and have no choice but to ignore dynamic objects. To analyze human gaze in dynamic scenes, Fathaliyan et al. [Fathaliyan-Frontiers2018]
proposed a 3D gaze tracking method that relies on a marker-based motion capture system installed in a small space. It inquires the motion capture tabletop objects’ poses in a moment and computes the intersections between the object models and gaze vector; however, the settings are costly and the model does not scale to larger environments. For wearable 3D gaze acquisition outside the laboratory, Qodseyaet al. [Qodseya-ECCVW2016] and Hausamann et al. [Hausamann-ETRA2020] developed eye-trackers equipped with depth sensors. They overlay 2D gaze points on the depth image and directly reconstruct the 3D human gaze. However, the scheme is highly sensitive to depth noise and the maximum measurement range. Moreover, the 3D gaze information is represented as cluttered 3D point clouds, which makes gaze analysis less meaningful than accumulation on model surfaces.
To address this, we enable the framework to install additional components of object reconstruction for instantiating dynamic objects not captured in the 3D environment map. The recent development of object recognition and tracking techniques has facilitated the determination of full 3D shapes of target objects from monocular images on the fly. Here, we exploit two methods to handle rigid and non-rigid objects, specifically household items and human models, respectively, for 4D attention mapping. Notably, any desired components that estimate the poses and 3D shapes of specific objects can be incorporated as explained below.
Iii-D1 Household item models (Rigid objects)
We introduce a pose detection and tracking method[Pauwels-TCSVT2016] into our system. Given the mesh models and textures of the target objects, it facilitates the recovery of the 6-DoF poses of hundreds of objects in real-time through the proposed scene simulation with SIFT features. The acquired information is sent to the same process as the 3D environment maps described in Section III-C; By attaching an ID texture to each model (Fig.4) and rendering it at the estimated 6-DoF pose, we can easily associate the 2D human gaze with the object model surface. Notably, Multiple Render Targets (MRT) on OpenGL are used to create an integer mask image that helps to distinguish the categories and individuals captured in the rendered view (see the bottom right of Fig.1). In the following experiments, an 8-bit integer mask was rendered in addition to the ID image in the MRT manner to distinguish up to 256 objects belonging to three categories: map, object, and human.
Iii-D2 Human models (Non-rigid objects)
The human model is a representative example of non-rigid objects that are important for analyzing perceptual activity in the real world. Humans change their postures unlike rigid objects; therefore, the reconstruction includes non-rigid deformation, making it more complicated than just detecting 6-DoF poses. In this research, we use the state-of-the-art method, FrankMocap[Rong-arXiv2020], to instantiate humans in a 3D environment map. It fits a statistical body model SMPL-X[Pavlakos-CVPR2019] to each person captured in the input image and provides their shape and pose parameters. The renderer in our framework subscribes the parameters to reconstruct the human models on-demand and examines whether the 3D human gaze hits the surfaces as in the rigid objects.
In this section, we verify the capability of the proposed framework to recover 4D human attention in dynamic environments. We first quantitatively evaluated the accuracy and precision of the recovered gaze points using a dynamic target marker, followed by demonstrations in real situations.
To build 3D environment maps, we used LiDAR, Focus3D (FARO Technologies, Inc.), which enabled us to capture dense and colored 3D point clouds. A panoramic spherical image can be generated by arranging each vertex color; we used it as a texture of the 3D map while thinning out some vertices to save GPU memory usage. Notably, our method only assumes that colored or textured 3D models are available for localization and gaze mapping, and thus it also operates on 3D geometric models reconstructed with different sensors, for example, RGB-D SLAM [Lee-CVPR2020], similar to [Paletta-IRCV2013].
The rendering and localization components rely on GPU parallelization; a GeForce GTX2080 performed the computations in all the experiments. We also used a wearable eye tracker, Tobii Pro Glasses 3 (Tobii Technology, Inc.) to capture first-person views with the subject’s 2D gaze information and IMU data.
Iv-B Performance evaluation
To evaluate the proposed attention mapping, AprilTag [Olson_ICRA2011], which provides reliable 6-DoF marker poses, was employed as shown in Fig.5, whereas the subject was changing the relative positions and its states. We asked the subject to stare at the center of the target board ( [m]) wearing the eye-tracker, and our method generated the corresponding 3D gaze points. In Evaluation 1, the board was embedded in the 3D map; thus, we calculated the Absolute Position Error (APE) between the generated 3D gaze points and the center of the board. In Evaluations 2 and 3, the ground truth trajectories of the agile target board were obtained by tracking a total station prism attached to the board with the known relative transformation using a Trimble S7 (Trinble Navigation, Limited.). Subsequently, we synchronized the pairs of trajectories based on system timestamps to evaluate the Absolute Trajectory Error (ATE)[Zhang-IROS2018] with a least-squares transformation estimation[Umeyama-TPAMI1991], in addition to APE. Notably, the 3D trajectory comparison computes a rigid transformation that minimizes the positional errors between the two point clouds. The minimization process cancels the systematic bias underlying the framework, which is caused by reasons such as eye-camera miscalibration. Therefore, the ATE is approximately equivalent to the precision of our framework, whereas the APE is equivalent to the accuracy.
Evaluation 1: We demonstrated the performance of our framework in a static scene to compare it with the most relevant work [Paletta-IRCV2013] as a baseline. Specifically, we implemented [Paletta-IRCV2013] whose localizer was replaced with state-of-the-art indirect visual localization[Campos-TRO2021] for a comparison in the same 3D map retaining the concept of the method. Compared with [Paletta-IRCV2013], 4D attention achieved high accuracy of 3D gaze mapping benefitting from the rendering-centerd framework such as direct localization and ID texture mapping, which suppress the systematic error.
Evaluation 2: The subject watched the center of the moving target standing at four different positions to evaluate the influence of proximity following the evaluations in previous studies[Paletta-IRCV2013][Hagihara-AH2018]. Overall, although the APE (inaccuracy) increased proportionally with the distance from the target board, the framework successfully suppressed the increase in the ATE (imprecision).
Evaluation 3: The subject walked around a [m] space to follow the moving target board approximately 1.5 [m] behind while watching the center. Notably, the subject and the person to follow held an assistant rope to maintain their distance. Although the proposed framework slightly increased the APE and ATE owing to the necessity of the 6-DoF and instant object reconstruction in a complicated situation, it successfully facilitated valid attention mapping even in highly dynamic environments.
|No.||object||subject||distance [m]||APE [m]||ATE [m]|
|state||state||from board||( inaccuracy)||( imprecision)|
|around||1.0 - 2.5||†||-|
|[0.5pt/1pt] 3||dynamic||following||approx. 1.5|
†: Errors of 3D gaze points generated by [Paletta-IRCV2013] (our implementation) as a baseline.
(a) Case 1: Observe physical actions of a person
(b) Case 2: Take a Coffee break
(c) Case 3: Pass by a person and buy a drink from a vending machine
To further evaluate our method, we performed attention mapping in three realistic situations as shown in Fig. 7. Figure 8 picks up “attractive” models in each case, in which accumulated human gaze is highlighted. 4D Attention robustly estimated the subject’s poses and 3D gaze directions, and simultaneously projected human gaze onto the static and dynamic targets. This facilitates the analysis of human intention or semantic understanding of the subject’s perceptual activities in the real world.
Case 1: As described in Sec.III-C, attaching different types of textures onto the models makes it possible to access various properties of the models, for example, semantics (see Fig. 8(a)). We easily understand which body parts the subject was focusing on (face and hands, in this case).
Case 2: Instance object reconstruction allows us to observe human attention in highly dynamic situations, for example, object manipulation. In case 2, after pouring hot water into the mug, the subject picked up freebies and took one. By accumulating gaze information on the models, we may acquire cues to determine the reason for the subject’s choice (Fig. 8(b)).
Case 3: We simulated a more realistic situation: The subject walked to a vending machine passing by a person and bought a drink from it. Our method successfully provided the trajectory, and attention to the static and dynamic objects of the subject (Fig. 8(c)), which helps in determining human behavior in the spatio-temporal domain.
In this section, we discuss the contributions, limitations, and practicality of the proposed method. According to Table II, which comprehensively compares the characteristics of different works, our framework is distinguished from other competitive methods in several aspects, for example, various targets, real-time operation, and easy setup on a simple 3D map. In particular, the rendering-centered framework provides significant benefits to direct localization and gaze projection via ID texture mapping, which leads to high accuracy of attention mapping as demonstrated in the evaluations.
Map-based methods, however, require a denser 3D map for accurate localization and attention mapping, which can also be a limitation of 4D Attention. Large 3D map reconstruction and rendering can restrict the application of the method to certain scenes. Fortunately, 3D reconstruction technologies, such as SLAM with LiDAR[Yokozuka-ICRA2021] or RGB-D cameras[Lee-CVPR2020], have evolved and are widely available. Techniques such as view frustum culling[Assarsson-JGT2000] also help in rendering large 3D maps for real-time processing for further applications in indoor and outdoor environments.
Moreover, as demonstrated in Section IV-C, learning-based shape inference, for example, [Rong-arXiv2020][Manhardt-arXiv2020], enables attention mapping to unknown dynamic objects by reconstructing target shapes on the fly. This also allows easier setup to free us from 3D modeling of specific objects, and strengthens our framework toward various usages.
We developed a novel gaze-mapping framework to capture human attention in the real world. The experiments demonstrated that the combination of robust camera localization, unified attention mapping, and instant object reconstruction enables access to 4D human attention.
The proposed system is capable of providing a series of human head poses (trajectory) and simultaneous gaze targets; thus, it would be applicable in action recognition, for example, skill-level evaluation in humanitude tender-care [Nakazawa-JIRS2019]. It also allows us to incorporate any desired components of instance object reconstruction into the framework, which facilitates attention analysis to specific objects and is helpful for gaze-based target selection in dynamic scenes [Chacn-IROSW2018]. Additionally, gaze accumulation on 3D models with multiple textures enables semantic analysis of human behavior.
|method||target||scalable||Real-time||sensors except||map||localization||attention mapping|
|2-3[0.5pt/1pt]||static map||dynamic objects||eye tracker|
|[Fathaliyan-Frontiers2018]||✓†||✓||Motion capture||-||Motion capture||Ray casting|
|[Dini-IROS2017]||✓†||✓||Motion capture||-||Motion capture||Ray casting (Sphere approx.)|
|[Maekawa-ICCVW2019]||✓||✓||LiDAR & Motion capture & IMU||3D point cloud||AMCL & Motion capture||Exhaustive examination|
|[Paletta-IRCV2013]||✓||✓||✓||RGB camera||Color meshes & feature points‡||Indirect visual localization||OBB-Tree|
|[Pfeiffer-ETRA2016a]||✓†||✓||RGB camera (& Kinect)||-||Visual markers||Ray casting (Box approx.)|
|[Hagihara-AH2018]||✓||✓||RGB camera||Color meshes & feature points‡||Structure-from-Motion||Ray casting|
|[Matsumoto-MobileHCI2019]||✓||✓||RGB camera||[Simultaneously built]||Mult-View Stereo & Geometry||Projection onto Delaunay trinagles|
|[Qodseya-ECCVW2016]||✓||✓||✓||Stereo camera||[Simultaneously built]||RGB-D SLAM||3D cluttered points from the depth|
|[0.5pt/1pt] Proposed||✓||✓(rigid&non-rigid)||✓||✓||RGB camera (& IMU)||Color meshes||Direct visual localization (C*)||ID texture mapping|
†: Optical or visual marker(s) should be associated with each object for pose tracking.
‡: Construction of an extra feature-point map that is strictly aligned to the 3D map is required for localization.