First presented by Vannevar Bush then restated by Jim Grey, Personal Memex paints the vision of the capability to capture and record personal experiences in daily life (bush1945we). Rapid improvement in mobile camera and signal processing technology has put us on the verge of realizing this vision. In particular, cameras in wearable forms, such as GoPro, Google Clips, and Spectacles from Snapchat, enable people to take photos and record videos anywhere, anytime.
However, challenges remain. Personalization is the core value of Personal Memex. Videos and photos captured by devices must match people’s personal interests to remain relevant and valuable. A fundamental limitation of existing wearable cameras is the inability to automatically detect and selectively record the visual content of personal interest. Most existing wearable cameras, such as GoPro and Spectacles, require manual intervention for content recording. Empowered by machine vision technologies, recently released Google Clips performs auto-capture if interesting scenes are detected. However, scene interest is estimated solely by the built-in machine vision algorithm: instead of being personalized to the user, it provides a one-size-fits-all solution. As noted by a reputable 3rd-party reviewer, quoted here “Clips’ standout feature – artificially intelligent auto-capture – is too unpredictable, keeping it from being truly satisfying”(Low18review).
Recent advances in eye-tracking technologies offer the capability to measure human visual attention (9157393; anton2013attentional). Based on physiology and psychology research (Broadbent1957A), human attention serves as an information filter and prioritization strategy (mancas2016human; Broadbent1957A), which selectively determines the allocation of cognitive resources. It operates as a gate that connects human inner consciousness with the outer world (2006The). Via attention gate, human selects the prioritized information regarding the most interesting areas or objects from a huge amount of unstructured perceived information (mancas2016human). Physiologically, humans direct their gaze to objects of interest (nelson2018). Human eye movement consists of several dynamic patterns, including saccade, smooth pursuit, vergence, and fixation (wang2019neuro); saccade and smooth pursuit account for most eye movements. Saccades are conjugate eye motions, where gaze is rapidly shifted from one location or object to another. They can be voluntarily directed or involuntarily triggered by visual stimuli. Smooth pursuit is smooth eye motion allowing the gaze to continuously follow an object of interest. Saccades and smooth pursuit may alternate (behera2005recurrent). They can be measured and used to produce visual attention time series, as shown in Fig. 1. Eyewear systems may include, or be retrofitted with, eye-tracking technologies, e.g., inward-facing eye cameras, to measure human regional gaze fixation within the field of view, and infer the focused region of human visual attention.
Despite its promising potential, eye tracking suffers from limitations that complicate practical use. First, eye tracking does not directly detect the object of human visual interest. Instead, it estimates human visual regional focus by measuring the angular direction of eye gaze relative to the field of view of the eyewear. Measurement noise, e.g., angular offset and head motion, can easily produce incorrect visual attention measurements. In addition, eye movement is noisy. Human gaze frequently experiences transient jitter: gaze may drift away from the object of interest due to other visual stimuli or noise (As shown in Fig. 1). Therefore, eye tracking alone is inadequate for reliable attention tracking. A potential remedy to this problem is to augment eye tracking with visual scene analysis using video analytics techniques, e.g., identifying the salient object of human visual focus using instance segmentation (yang2019video) and analyzing the spatial visual context using object detection and recognition (937604). This motivates us to combine eye-tracking with video analytics to enable accurate human visual attention detection and tracking. However, machine vision techniques are computationally demanding and energy-intensive, limiting adoption in resource-constrained wearable settings. These challenges must be addressed in order to accomplish our goal of accurate and efficient human visual attention tracking.
This paper presents MemX, a biologically-inspired attention-aware eyewear system to enable automated, highly personalized capture of interesting visual content, which is recorded in compact video snippets. We retrofit off-the-shelf glasses with an inward-facing eye camera and a forward-facing world camera. The eye camera performs eye tracking to determine human gaze direction. The world camera samples and analyzes the field of view. We propose a new temporal visual attention (TVA) network, which unifies eye tracking and video analysis, to enable accurate and computationally efficient human visual attention tracking and salient visual content analysis. Specifically, the eye camera performs eye tracking to determine human visual regional focus, which confines computation-intensive video analytics tasks solely to regions of personal interest, thereby minimizing computing time and energy costs. Working in tandem, the video analytics component analyzes the visual scene content and identifies the salient object of human visual focus as well as the spatial visual context, thereby improving the accuracy and stability of human visual attention tracking. In addition, the high-resolution scene frame is used as supplementary input to the TVA network for more accurate attention detection. Such high-resolution scene frames are captured only when likely attention is detected, which significantly reduces energy consumption by decreasing the average sampling frequency. Furthermore, eye movement types, i.e., saccades or smooth pursuit, act as an “intermediate” supervision to be incorporated into the learning process.
This work makes the following contributions.
We propose a new temporal visual attention (TVA) network that unifies eye-tracking and video analytics, to enable accurate and computation-efficient human visual attention tracking.
We develop MemX, a new biologically inspired, attention-aware eyewear system. Equipped with the proposed TVA network, MemX automatically detects human visual attention and uses it to analyze and record visual content of personal interest.
MemX is evaluated using the YouTube-VIS dataset (yang2019video) and 30 participants. Experimental results show that compared to the eye-tracking-alone method, MemX significantly improves the attention tracking accuracy, while maintaining high system energy efficiency.
MemX can be applied in several personal visual content capture scenarios, such as sightseeing, lifelogging, academic events (e.g., meetings), shopping, and sports logging. We conduct 11 in-field pilot studies with different potential usage scenarios to evaluate MemX and illustrate some of its potential uses and benefits. According to the questionnaire survey, 96.05% of the moments of interest are successfully captured by MemX. In addition, the pilot studies demonstrate significant energy savings as compared with the record-everything case, with 86.36% energy savings on average.
The rest of the paper is organized as follows. Section 2 surveys related work. Section 3 presents the proposed MemX system and algorithm. Section 4 details MemX system design and implementation. Section 5 presents and discusses the experimental results. Section 6 presents the in-field pilot studies. We conclude the work in Section 7.
2. Related Work
This section summarizes the most relevant work in the areas of eye tracking, eye movement classification, visual attention, human interest, and computation-efficient video analytics.
2.1. Eye Tracking
There has been extensive work on gaze tracking and estimation. Two comprehensive reviews of existing approaches are provided by Hansen et al. (hansen2009eye) and Cazzato et al. (cazzato2020look). These existing methods can be categorized into two classes: model-based approaches (wood2014eyetab; wang2015atypical) and appearance-based approaches (zhang2017mpiigaze; zhang18_etra). For model-based methods, high-quality images and appropriate lighting conditions are critical, while appearance-based methods leverage more facial features, such as head orientation, from large-scale training data. Compared with model-based methods, appearance-based methods are less sensitive to image quality. Recent work has shown that appearance-based methods can offer good generalization capability even without user-specific data (zhang2019evaluation)
. Thanks to the emerging large-scale data sets and deep learning techniques, the performance of learning-based eye-tracking models has been steadily improving(krafka2016eye; wong2019gaze; wang2019generalizing). Our system unifies eye tracking and video analytics to track human visual attention, allowing video analytics to focus on salient visual content, thereby reducing computation and energy costs.
2.2. Eye Movement Classification
As a fundamental step towards eye-tracking (komogortsev2010qualitative), eye movement classification has been extensively studied and a wide range of methods have been proposed, including classical algorithms (salvucci2000identifying), Bayesian Mixed Models (BMM) (Bayesian2012eyemovement), and deep learning (Raimondas2018gazeNet)
. Most classical approaches identify saccade and fixation movements using velocity-threshold identification, dispersion-threshold identification, and hidden Markov model identification(djanian2019eye). Later, Tafaj et al. (Tafaj2013) and Santini et al. (Santini2016Bayesian)
added eye movement classification of smooth pursuit in BMM. Recently, deep learning has been used to classifyeye movements. For example, Zemblys et al. proposed a CNN-based architecture for sequence-to-sequence eye movement classification (Raimondas2018gazeNet). Startsev et al. combined 1D-CNN with BLSTM to classify eye movements as fixation, saccade, and smooth pursuit in windows of up to 1 s (startsev20191d). Our system uses eye movement types as a supplement “intermediate” supervision to drive system-level learning in MemX, and also identifies the instants of visual attention.
2.3. Visual Attention and Eye Tracking
Due to the limited processing capabilities of biological vision system, attention (and high spatial resolution capture by the fovea) is directed to a small portion of a scene (carrasco2011visual), thereby reducing the computational demands on the visual cortex. Leveraging attention, humans can flexibly control limited computational resources (lindsay2020attention) to process the most relevant and preferential sensory information (anton2013attentional). Recently, extensive studies have combined visual attention and eye-tracking technologies. For instance, Hwang et al. proposed to use eye-tracking to quantify human visual attention in an online shopping environment (hwang2018using). Katz et al. used eye-tracking to identify regional gaze fixation on a respiratory function monitor for medical diagnosis (katz2019visual). Wang et al. applied visual attention in unsupervised video object segmentation (wang2019learning). These works promote the idea of combining eye-tracking and video analytics methods to enable accurate and efficient human attention detection and tracking.
2.4. Visual Attention and Human Interest
Previous works in the field of physiology and psychology research demonstrate that eye movements reveal human attention and gaze positions indicate interesting and important viewed content. Jain et al. pointed out that eye movements can reveal attention’s change or shift (jain2015gaze). For example, saccade means quick visual attention shift, fixation means the visual attention rests at the same location, and smooth pursuit suggests visual attention follows a moving target smoothly (jain2015gaze). Tag et al. (10.1145/3027063.3053243) described a smart eyewear platform capable of attention tracking by measuring the cognitive engagement of people in different situations. Later, Abdelrahman et al. (abdelrahman2019classifying) pointed out that eye data can be indicative for human cognition (abdelrahman2019classifying; 2018Watch). For instance, gaze positions can reveal the attention locus (abdelrahman2019classifying). Similarly, Kranthi et al. (2018Watch) suggested that gaze can be used as an indicator of importance when humans watch a video. That is, gaze positions indicate important or interesting viewed content (2018Watch).
2.5. Computation-Efficient Machine Vision
Deep learning has achieved great success in machine vision applications, such as object detection, object classification, and semantic segmentation. Deep models, such as ResNet (he2016deep), DenseNet (huang2017densely), MobileNetV2 (sandler2018mobilenetv2), and MobileNetV3 (howard2019searching) are widely used in machine vision applications. However, deep models incur high computation costs, limiting adoption in resource-constrained mobile environments. Light-weight computation-efficient network architecture is therefore an important research branch of deep learning communities. Take for example the architecture of Google’s MobileNetV2 (sandler2018mobilenetv2) which is a representative light-weight network designed for mobile device. Linear bottlenecks and inverted residual blocks were used as the basic structure in MobileNetV2, aiming to solve the issue of feature degradation during training. Also, as pointed out by Lubana et al., energy consumption significantly depends on the transferred resolutions in imaging systems (lubanaDigitalFoveationEnergyAware2018). Therefore, they proposed that energy consumption can be dramatically reduced if only the task-related information is input to deep models. Inspired by the above work, we propose a new lightweight temporal visual attention network, which fuses two visual input sources, i.e., temporal eye movements and field-of-view scene video, to offer higher computation efficiency while still guarantee task performance.
3. Temporal Visual Attention Network
3.1. Design Motivations
The vision of the personalized Memex builds on the key observation that content capture must match a user’s personal interest in order to be relevant and valuable. As visual attention directly reflects human attentive interest, the goal of this work is to design an accurate and efficient method for human visual attention tracking and salient content analysis. In this section, we first describe human visual attention, and then discuss the capabilities and limitations of eye-tracking and video analysis techniques, which motivate the proposed temporal visual analytical network, a unification of these two techniques.
Humans direct their foveas, the high-resolution regions of their retinas, toward locations attracting their attention: foveal orientations reveal locations of visual attention. Since real-world scenes often contain multiple stationary and moving objects, attention is a temporally and spatially selective process, driven by top-down influences and bottom-up sensing mechanisms. It starts with steering the eye gaze towards an object attracting attention, then continuously follows the object of interest, until new visual stimuli or top-down goals cause the fovea to drift away.
From the eye-tracking perspective, as stated in Section 1 and illustrated in Fig. 1, human eye movement patterns can be mainly classified into two distinct and alternating phases, namely saccade and smooth pursuit (leigh2015neurology)
. Saccade, the rapid conjugate eye shift, is triggered either voluntarily or by an external visual stimulus, indicating the change or a new moment of human visual attention. Smooth pursuit, on the other hand, refers to slow and smooth eye movements following the same object of interest. Together, a human visual attention epoch starts with a saccade phase, followed by a smooth pursuit phase. Therefore, eye movement analysis, e.g., saccade-smooth pursuit transition detection, can be used to detect and classify the moments of human visual attention. In addition, as described in Section2, recent work on gaze tracking provides relatively accurate estimation of human visual regional focus by measuring the angular direction of eye gaze relative to the eyewear’s field of view. Furthermore, gaze tracking is computationally efficient, as it requires low-resolution eye images and light-weight algorithms. However, as stated in Section 1, gaze tracking does not directly detect the object of interest, and cannot independently provide accurate and stable indications of human visual interest, due to limited tracking resolution and inherently noisy eye movement patterns (as shown in Fig. 1).
A key motivation of this work is that video analysis techniques, e.g., object detection and classification, offer object or semantic level information, which is complementary to regional-level gaze tracking. It is thus possible to augment gaze tracking with video analysis techniques to perform accurate and stable human attention detection. However, object and semantic level video analysis tasks are data- and computation-intensive. Consider recent work on video instance segmentation (VIS) (yang2019video); VIS is one of the most broadly-used video analysis tasks, consisting of simultaneous detection, segmentation, and tracking of object instances. The latest work on VIS adopts ResNet-50-FPN or ResNet-100-FPN (johnson2018adapting) as a feature extractor and uses regional proposal network (ren2015faster) to identify possible regions of objects. However, the high energy demand of the VIS task limits its adoption in computation and energy-constrained wearable settings.
We observe that gaze tracking can help improve the efficiency of video analysis tasks. First, gaze tracking efficiently captures human visual regional focus, focusing video analysis on regions of interest and eliminating analysis of other, superfluous regions, thereby improving computational efficiency. Second, gaze tracking can detect human visual attention, enabling video analysis to be triggered only when human attention is likely. Furthermore, eye movement types, i.e., saccades or smooth pursuit, can serve as an “intermediate” supervision that can be incorporated into the learning process to regularize the learning problem, leading to more accurate decisions.
In summary, working in tandem, gaze tracking and video analysis can potentially deliver accurate and efficient human visual attention tracking and serve as the foundation to support personalized moment auto-capture.
3.2. Pipeline of the TVA Network
Here, we describe the overall pipeline of the proposed temporal visual attention (TVA) network that we deploy in the MemX system to detect the user’s visual attention.
Fig. 2 shows the proposed TVA framework that unifies eye-tracking and video analysis to detect the user’s visual attention. Two video streams serve as the inputs of this framework, which include (1) the scene stream captured by the forward-facing world camera; and (2) the video sequence captured by the inward-facing eye camera, where is the total number of time steps. For a given time step , the TVA network predicts whether the user is attentive to a newly-appeared instance or tracks a previously-attentive instance for the current frame with a binary result . If true (), MemX invokes high-resolution video recording. Otherwise (), the current frame is discarded to save energy.
This work develops MemX to enable automated, personalized capture of interesting visual content, with the energy-constrained wearable form factor. To this end, the proposed TVA network aims to deliver accurate and energy-efficient visual attention detection by unifying eye-tracking and video analysis. To generate temporally-consistent predictions, the TVA network continuously senses the inward-facing low-resolution eye video and extracts two essential gaze representations, which are the likelihood of historical eye movement types and the historical gaze positions , as shown in Fig. 2. Those two gaze representations are fused together to make an initial prediction that a potential attention epoch is occurring. In particular, the likelihood of historical eye movement types are used to capture gaze saccade-smooth pursuit transition, which is a clear indicator of the beginning of an attention epoch. However, the gaze transition likelihood alone may not be reliable since the eye movements are subtle and difficult to identify. Therefore, we also use the historical gaze positions to improve detection robustness. Intuitively, if the majority of the recent gaze positions fall into a small circular region, we can predict with higher confidence that the user’s eye movement has indeed entered a smooth pursuit phase. Moreover, as pointed out by prior works such as pupil detection and tracking (kassner2014pupil), which is suitable for energy-constrained wearable scenarios, eye tracking is computationally efficient. Eye tracking may therefore be used for every frame, which can effectively minimize the invocation of computation-intensive video analysis processing and energy-intensive high-resolution video recording.
As mentioned earlier, eye tracking alone is not sufficiently accurate for attention detection. To boost the accuracy of attention detection, the forward-facing high-resolution world camera is also used only when human attention is likely. The TVA network then extracts low-level scene features from the high-resolution frame , and fuse it with the gaze features extracted from historical gaze positions . Compared with the size of the original scene image (224224), the spatial resolution of the gaze feature is more compact (5656), and it can also guide our TVA network to focus on analyzing the attentive local region, which further reduces the computational cost.
The obtained scene features are then fused with the likelihood of historical eye movement types and the historical gaze positions in the TVA network to better predict the temporal attention of the user.
Instead of treating eye-tracking and video analysis as two separate tasks, we design a light-weight network to perform feature fusion. Specifically, we first apply a shallow network to obtain features for frame , i.e., . We then map the historical gaze tracking results to gaze position related features and the likelihood of eye movement phases . The mapping function is denoted as , i.e., . After that, we fuse and using a predefined operation that makes those features complementary to each other so as to augment the detectability of salient region focus. In addition, is leveraged as an “intermediate” supervision to drive the learning process. The intuition is that if attention is drawn to an object, historical eye movements may follow a detectable pattern, which can be used as prior knowledge to supervise the learning network. For example, an eye-movement sequence: saccade, smooth pursuit, , smooth pursuit, can suggest the occurrence of attention with high confidence. Finally, the above output goes through a classification model to generate the final decision of (attention or not). Next, we provide the detailed implementation of the proposed TVA network.
Given an incoming eye frame , we want to detect user attention, i.e., whether the user is potentially gazing at an object within the field of view. We experimentally define the occurrence of potential attention as the majority of the gaze positions falling within a region with area or smaller during a time period . The impacts of and on attention detection performance are evaluated in Section 5.
If possible attention is detected at time , we then trigger the world camera and sense scene frame to further boost the confidence level that the user is indeed paying attention to an object. Specifically, let be the predicted historical gaze positions during time steps where , and let be the predicted historical eye movement types likelihood where each . The goal of the TVA network is to determine from , and whether the user is actually gazing at an object within the scene frame, i.e., , and represents the proposed TVA network.
After that, a shallow network is applied to the scene image to generate the scene features with size , i.e., Since the historical gaze positions are discrete and not differentiable, we relax each historical gaze position into a continuous heatmap of size
based on Gaussian distributions. Specifically, for thegaze position , the value for a location in the heatmap can be computed as
where represents the Euclidean distance between and , and is a pre-defined weight related to the time step . Applying the Gaussian Relaxation in Eq. 1 to each gaze position in will generate heatmaps of resolution . We denote those heatmpas as .
Finally, The gaze heatmaps , the scene features and their element-wise product are concatenated channel-wise to fuse information from gaze analysis and scene images. The resulting features, which are of size , are then fed into one
convolution layer Conv0 with 56 output channels. A following fully-connected layer FC0 gradually reduces the channels into two representing the probability of the binary action. In addition, inspired by the work of Wang et al. (wang2020dynamic), we append the likelihood
of historical eye movement types with the input tensor of FC0 layer to supervise the learning during training and help with the inference during testing.
4. MemX: An Attention-Aware Eyewear System
This section describes MemX’s hardware design and TVA network integration to support the operation of personalized moment auto-capture. Since energy efficiency is a key focus of MemX, this section also describes the energy model of MemX.
4.1. MemX Hardware Design
MemX is prototyped using smart eyewear with an inward-facing eye camera and a forward-facing world camera. The eye camera performs gaze tracking. The world camera samples and analyzes the field of view. Fig. 3 illustrates the hardware prototype of MemX. The first generation of MemX uses a Logitech B525 with 1,280720 resolution as the front-view camera installed on the upper side of the eyewear frame, and a Sony IMX 291 with 320240 resolution as the eye camera, secured by a standalone arm. We are in the process of developing the second-generation hardware prototype. A key modification is to remove the standalone arm to improve the usability and user-friendliness.
The first-generation prototype uses the Jetson Xavier NX mobile computing platform. We are in the process of migrating the MemX
design to the Ambarella CV28 low-power vision platform directly integrated into one of the eyewear legs, with a battery in the other leg, similar to the hardware design of Spectacles from Snapchat. The CV28 platform is equipped with built-in advanced image processing, high-resolution video encoding, and CVflow computer vision processing capabilities(ambarellacv28). The typical power consumption of CV28 is in the range of 500 mW, which is suitable for battery-powered wearable design.
4.2. MemX Software Operation
MemX is equipped with the proposed TVA network, which uses attention tracking to achieve personalized moment auto-capture. The TVA network continuously tracks eye gaze to detect potential attention events. If a possible attention event is detected, the TVA network will turn on the world camera, and eye gaze and scene sequences are both fed into the TVA network to make the final attention decision.
As described in Section 3, attention is a direct indicator of salient visual content. Furthermore, the attention level, or the level of interest, can be quantified by , the duration of the smooth pursuit phase, and longer duration implies stronger personal interest. Therefore, controls the selectivity of moment recording. Furthermore, the proposed TVA network uses computation-efficient eye-tracking to detect potential attention events. A larger makes eye-tracking more selective. Therefore, the fusion stage of the TVA network, which is more computation and energy intensive, will be triggered less frequently, thereby improving system energy efficiency. The details of how affects the performance and energy consumption of MemX are discussed in Section 5. One other design parameter of MemX is the duration of each video snippet. In the current implementation, MemX continuously records a detected moment till a pre-defined duration threshold is reached.
4.3. Energy Model
The energy consumption of MemX is mainly contributed by the world camera, eye camera, the TVA network, and also high-resolution video recording. Next, we characterize the energy consumption of these key components.
4.3.1. Energy Consumption of Imaging Pipelines
The operation of an imaging pipeline starts with sensing incoming light and converting it into electrical signals. Then, an image signal processor (ISP) receives the electrical signals and encodes them into a compressed format. The energy consumption of imaging pipeline is contributed by the following three components (lubanaDigitalFoveationEnergyAware2018): (image sensor), (ISP), and (communication), as follows:
Next, we discuss the energy consumption of the aforementioned components.
(1) Energy consumption of image sensing. The operation of an image sensor consists of three states, i.e., idle, active, and standby. The power consumption of the standby state is negligible (typically in the range of 0.5-1.5 mW), thus ignored from the energy model. Eq. 3 defines sensor energy, as follows:
where and are the average power consumption when the sensor is in the idle state and active state, respectively. is the exposure time, and the image sensor is idle during the exposure phase. is the time duration when the image sensor is active, which is determined by the ratio of transferred frame resolution to the external clock frequency , i.e., . Here, and can be viewed as sensor-specific constants, and is a linear function of sensor resolution ().
(2) Energy consumption of ISP. The ISP operates in two states: idle and active. It is active during image processing () and idle during image sensing. The time for image sensing is the sum of exposure time and the transferring time of frame (in pixels), i.e., . The energy consumption of ISP is then determined as follows.
where and are the average power consumption of the ISP in the active and idle state, respectively.
(3) Energy consumption of communication interface. The energy consumption of the communication interface is a linear function of the number of transferred frame pixels (lubanaDigitalFoveationEnergyAware2018), as follows:
where is a design-specific constant determined by the communication interface.
As can be seen from Eq. 2, 3, 4, and 5, the energy consumption of cameras highly depends on the sensor resolution. For MemX, the energy consumption of the high-resolution world camera is significantly higher than that of the eye camera. Therefore, the TVA network uses eye-tracking alone to first detect potential attention events, and only then turns on the world camera for accurate attention detection and salient content recording, thereby effectively minimizing the energy cost from the world camera.
4.3.2. Energy Consumption of MemX
MemX is equipped with the proposed TVA network. The operation of the TVA network consists of two stages. First, the TVA network continuously performs eye-tracking to detect potential attention events. This stage has high energy efficiency, thanks to the low data rate of the eye camera and computationally-efficient eye tracking. When a possible attention event is detected, the TVA network invokes the second stage, using a light-weight network to perform eye-scene feature fusion in order to finalize the attention decision. We denote the average power of the two stages as and , respectively.
The energy consumption of MemX is then formulated as follows:
where is the operation time of MemX, and are the power consumption of the inward-facing eye camera and the forward-facing world camera, respectively, is the operation time of eye-scene feature fusion, is the operation time of high-resolution video recording when human visual attention events are detected, is the power consumption of the host processor during high-resolution video recording, mostly contributed by video encoding and storage. Compared with the eye-tracking alone stage, the second stage is more data and computation intensive, but is only triggered when potential attention events are detected, and stays off most of the time, thus effectively reducing energy cost. In addition, the TVA network consists of a new light-weight network architecture, which is significantly more efficient than the existing VIS-based design, e.g., MaskTrack R-CNN architecture (yang2019video). As shown in the experimental studies, the proposed TVA network significantly outperforms the VIS-based method in terms of system energy efficiency.
This section evaluates MemX
, the proposed attention-aware smart eyewear system for personalized moment auto-capture. We first detail the experimental settings, including user studies, data collections, and evaluation metrics. Then we present the evaluation results to quantify the performance and efficiency ofMemX. The studies described in this section are conducted in a controlled lab environment. In-field pilot studies will be presented in the next section.
5.1. Evaluation Methodology
5.1.1. Experimental Setup
The controlled lab setting is shown in Fig. 4. A group of participants wear MemX, sit in front of a computer, and watching a sequence of video clips. During each study, video clips are recorded through the field of view of MemX, which simultaneously tracks the participant’s visual attention. In total, 30 participants are recruited. A summary of the participants are provided below.
Gender: 10 (33.3%) female and 20 (66.7%) male,
Age: between 21 and 35,
Vision: normal vision (7, 23.3%), nearsighted 3.0 diopters (8, 34.7%) and 3.0 diopters (15, 50%).
5.1.2. Video Data Preparation
In this work, we adopt the YouTube-VIS dataset111https://youtube-vos.org/dataset/vis/ (yang2019video) as “micro-benchmark in a controlled setting” to quantitatively evaluate such technical capability of MemX. In addition, we use pilot studies to further evaluate MemX in real-life scenarios.
Specifically, the YouTube-VIS dataset covers a wide range of complex real-world scenarios. For instance, the dataset consists of temporally dynamic scenes, such as sports, as well as spatially diverse scenes, such as multiple objects within the same scene. Therefore, using the YouTube-VIS dataset, we can design a wide range of interesting testing cases to evaluate the accuracy and efficiency of MemX for eye tracking and attention detection. For example, the participant is initially attracted by a white duck within a scene consisting of multiple animals, and then his or her attention shifts towards a different animal. In addition, as a widely used dataset, the YouTube-VIS dataset includes necessary object annotation information regarding object location, classification, and instance segmentation. Such annotations help us to accurately set up the ground truth in terms of the objects of interest during the experiments.
The YouTube-VIS dataset consists of a 40-category label set and 2,238 videos with released annotations. Each video snippet lasts 3 to 5 seconds with a frame rate of 30 fps, and every 5th frame is annotated for each video snippet. We first divide the 2,238 videos into four classes based on their content, including Animal, People, Vehicle, and Others. As a result, the four classes contain 1487, 437, 215, and 99 videos, respectively. We then randomly select videos in each class, resulting in 1,000 videos total (—670, 197, 97, and 36), which are used in this study. Fig. 5 visualizes cases of the 4-class video frames in YouTube-VIS dataset.
Since each video snippet lasts 3 to 5 seconds in the YouTube-VIS dataset, we concatenate multiple video snippets to form videos with a duration of about 7-15 minutes (approximately 100 video snippets). The concatenated video is then provided to the participants to watch. We recruited 30 volunteers to collect their eye data when they watch the concatenated videos. To alleviate the potentially person-dependent bias, we ensure each video snippet is watched by several participants (three in this work). Given the 30-recruited participants, we select 1,000 video snippets from the entire YouTube-VIS datasets.
5.1.3. Eye Data Collection
MemX uses the inward-facing eye camera to capture eye video data. We calibrate MemX to correlate the world camera with the eye camera. The calibration method follows Pupil Capture222https://docs.pupil-labs.com/core/software/pupil-capture/#calibration (kassner2014pupil). Each participant watches approximately 100 videos randomly selected from the four classes in our benchmark video data. That ensures each video in our benchmark can be watched by three participants. We pre-select the target attentive object in each video snippet as users’ visual interest, and we guide participants to gaze at the pre-assigned object and track their motion. We then obtain the users’ eye video dataset that is synchronized with the video dataset.
In the following experiments, the video dataset and the eye dataset are randomly divided into training set, test set, and validation set with a 70%:10%:20% ratio.
5.1.4. Evaluation Metrics
We use precision and recall to evaluate the accuracy of human visual attention tracking ofMemX, defined as follows:
where (true positive) denotes when MemX correctly identifies the attentive object of interest, (false negative) denotes when MemX fails to identify the attentive object of interest, and (false positive) denotes when MemX incorrectly identifies an object of interest that the user actually did not pay attention to. In summary, the higher the precision and recall, the better the accuracy of MemX. In addition we also consider average precision (AP) which jointly considers precision and recall measures, as follows.
where and are the precision and recall at the th threshold.
To evaluate the energy efficiency of MemX, we consider the following hardware settings. The MemX prototype is equipped with a Sony IMX 291 based eye camera and a Logitech B525 based world camera. The Logitech B525 camera supports a maximum resolution of 2.07 M-pixels. In addition, we target the Ambarella CV28 low-power computer vision SoC platform, which is equipped with advanced image processing, high-resolution video encoding, and CVflow computer vision processing capabilities. The power estimation is based on the model described in Section 4.
5.1.5. Training TVA Network
This work uses Adam (kingma2014adam) as the optimizer, with an initial learning rate of 0.01, which is decreased by 10% after every 30 epochs until 100 epochs have occurred. For the shallow network in TVA, we adopt the first two blocks of MobileNetV2 to extract the scene frame features as MobileNetV2 is efficient enough for mobile devices (sandler2018mobilenetv2). We use pre-trained weights provided in prior work (sandler2018mobilenetv2) as a starting point for training TVA. During training, we freeze during the first 30 epochs.
To our best knowledge, there is no prior work that targets the same research challenge and is directly comparable to this work. For comparison, we consider the following two baselines.
(1) Eye-tracking-alone. This approach uses eye tracking alone to capture potential attentive visual content. Specifically, we first use eye tracking to detect the saccade-smooth pursuit transition, an indication of potential visual attention shift. Then, we measure the gaze regional focus relative to the field of viewing during a smooth pursuit phase or a fixation phase for a time period. We experimentally define the occurrence of actual attention as when 90% gazes are located in a close region with area for a time period , where and denote the width and height of viewing scene frame, respectively, and 0.05 is the rescaling ratio. Considering that the Logitech B525 camera has a 69 diagonal field-of-view 333www.logitech.com/assets/64667/b525-datasheet.ENG.pdf and angular error is approximately with median value of 3.45 (cazzato2020look), we estimate the rescaling ratio as 3.45/69 (i.e., 0.05) in this work. The effectiveness of how affects the performance of attention detection is evaluated in Section 5.2.2. The following experiments use the eye-tracking-alone method to establish the baseline accuracy of the proposed work.
(2) VIS-based method. This method uses eye tracking and VIS-based object detection to jointly capture potential attentive visual content, and the VIS-based object detection task adopts the MaskTrack R-CNN architecture (yang2019video). Different from the proposed TVA network, in the VIS-based method, eye tracking and VIS-based object detection are two independent, parallel tasks. As a result, VIS-based object detection is always on, which may introduce significant computation and energy overhead. As will be shown in the experimental results, the processing speed of the VIS-based method is only 0.6 frames per second, preventing it from practical adoption into wearables. The purpose of including this method in the experiments is to establish the baseline energy efficiency of the proposed work.
(3) Saliency-map-based method. The saliency detection method (Marat2009Modelling) aims to predict the salient regions that has potentially attracted the use’s attention, which is somehow similar to the target of the proposed TVA network. For a fair and complete comparison, we adopt and evaluate the saliency detection method in (Marat2009Modelling) as an additional baseline. We will refer it as the saliency-map-based method for simplicity.
5.2.1. Overall Performance
Table 1 shows the accuracy and energy efficiency comparison between MemX and the three baseline methods. The precision, recall, and average precision AP of each method are shown in columns 2, 3 and 4. The energy reduction of the proposed work and the eye-tracking-alone based baseline method relative to that of the VIS-based baseline method is shown in column 5.
As the table shows, the eye-tracking-alone method achieves low precision and recall: it cannot be used for accurate and stable attention tracking. In contrast, the proposed TVA-based method significantly improves the attention tracking precision and recall, with only a slight increase in energy consumption. In addition, the proposed TVA network achieves higher average precision than the two baseline methods.
Among these four methods, the VIS-based method achieves the highest precision and recall, which however is associated with significant energy and computation penalties. Specifically, compared with the proposed TVA-based method, the energy consumption of the VIS-based method is 2.7 higher. Furthermore, using the Jetson Xavier NX platform, the processing speed of the VIS-based method is 0.6 frames per second, which is too slow for practical use. In contrast, the processing speed of the proposed TVA-based method exceeds 30 frames per second, offering at least a 50 speedup.
Compared with the VIS-based method, the saliency-map-based method is with significant accuracy degradation. This is due to the fact that the saliency-map-based method can only provide region-level information, instead of object-level information. An object often contains multiple regions, and human’s visual attention on a specific object may temporally shift across the multiple regions of the same object. As a result, the saliency-map-based method fails to accurately detect whether a person gazes at the same object. Compared with the saliency-map-based method, the VIS-based method offers object-level information, thus offering better visual attention analysis accuracy. Also, the proposed TVA-based method outperforms the saliency-map-based method. This is because the TVA network is designed to provide object-level information in a light-weight architecture complemented with the regional-level eye-tracking task. Experimental results demonstrate that, compared against the saliency-map-based method, the proposed TVA-based method achieves 58.48% accuracy improvement (average precision) with 45.64% energy savings gain.
In summary, the three baseline methods suffer from either serious accuracy or energy efficiency limitations. In comparison, the proposed TVA network strikes a good balance between accuracy and efficiency for attention tracking.
|Method||Precision||Recall||Average Precision||Energy Savings|
|TVA-Based Method (Proposed)||87.50%||86.40%||93.65%||70.61%|
5.2.2. Usability Discussion
MemX uses attention tracking to control personalized moment auto-capture. As described in Section 4, the duration of the smooth pursuit phase indicates the level of interest. In MemX, the proposed TVA network uses the parameter to determine how long a potential smooth pursuit phase has been detected using eye tracking before triggering high-precision attention tracking and salient content recording. By adjusting the value of , MemX controls the selectivity of moment recording, as well as the energy efficiency of the system.
Fig. (a)a shows the impact of on the trigger rate of MemX. As shown in this figure, the trigger rate decreases as increases. Fig. (b)b shows the impact of on the energy consumption of attention detection. As shown in this figure, the energy saving increases as increases. This is due to the fact that, with the increase of , the eye-tracking component of the TVA network becomes more selective, and the data and computation intensive fusion stage is less likely to be triggered, thereby reducing the energy cost.
For the in-field pilot studies (described in the next section), is empirically set to 1 s. Based on the feedback from the pilot studies, this value strikes a good balance between selectivity of salient content recording and system energy efficiency.
5.2.3. Case Study
Next, we present two cases to offer further insights regarding why the proposed TVA-based method outperforms the eye-tracking-alone method for attention detection.
(1) A False Negative Case. As described in Section 1, eye-tracking-alone cannot provide accurate attention tracking, due to limited tracking resolution and inherently noisy eye movement patterns. Fig. 7 shows such a case. As we can see, the user’s attention focuses on a fast-moving vehicle, and gaze trajectory is jittering and noisy. Therefore, it is challenging for eye tracking to distinguish between saccade and smooth pursuit. In contrast, the TVA network leverages both regional information provided by eye tracking and object and semantic level information by video analytics, thus is able to accurately detect the fast-moving vehicle as the object of interest.
(2) A False Positive Case. Consider a the frequent inattentive event in which a person is not paying attention to the surrounding environment and gaze is fixed on one position. Eye tracking alone will mistakenly classify this case as a qualified attention event. In contrast, the TVA network is able to detect such a false positive case. Specifically, as shown in Fig. 8, the TVA network recognizes that the regional visual focus along the gaze trajectory does not match a specific object. Therefore, there is no salient visual object that draws the user’s attention.
6. Pilot Study
MemX can be applied to a wide range of potential application scenarios. We have conducted multi-round user interviews to explore potential usage scenarios. Suggested popular scenarios include daily lifelogging, traveling and sightseeing, sports logging, public event summary, etc. After examining these scenarios, we observe that different scenarios exhibit distinct characteristics. In particular, we have identified two key attributes, namely scene complexity and human attention dynamics to categorize individual scenarios as follows.
Scene complexity depends on the number of objects in each scene as well as their motion patterns. A simple scene might contain few, mostly stationary, objects while complex scene might contain many fast-moving objects.
Human attention dynamics depends on the frequency that human attention switches among objects. For example, in a basketball game, human attention is highly dynamic. Book reading, on the other hand, is mostly stationary.
6.1. Data Collection and Description
MemX aims to enable automated, personalized capture of interesting visual content, with energy-constrained wearable form factor. To evaluate the accuracy regarding personalized moment auto-capture and energy efficiency of MemX, we select a set of potential usage scenarios, covering the four combinations of the two characteristics. We then conduct in-field pilot studies targeting these scenarios, including 11 pilot studies in total (5 female, 6 male, aged between 21 and 40), covering a diverse range of real-life scenarios. During each study, the participants wear MemX to auto-capture visual moments of interest in the form of short video clips. For comparison purposes, the complete visual experience of each study is also recorded by the world camera of MemX (baseline video). In total, equipped with MemX, 11 participants recorded approximately 245 minutes of baseline video, with about 37 minutes of video clips auto-captured as moments of interest. Fig. 9 shows some exemplary moment snapshots captured by MemX.
6.2. Accuracy of MemX in Pilot Study
6.2.1. Overall Accuracy
To evaluate the accuracy of MemX, after each user study, we asked the participant to review the baseline video and manually mark the moments reflecting his or her true interest during recording (ground truth), which are then compared against the video clips auto-captured by MemX. Table 2 summarizes the accuracy of MemX. As shown in Table 2, MemX can accurately detect and automatically capture 96.05% of visual moments of interest across the 11 pilot studies. In other words, the video clips auto-captured by MemX accurately reflect the participants’ moments of interest.
|5||Going for a walk||9.01||1.14||87.30%||5||0||85.37 %|
|7||Playing game||22.74||0.44||98.08%||5||2||96.15 %|
MemX analyzes the relationship between human attention and interest from the following three aspects simultaneously: (1) the temporal transition from saccade to smooth pursuit, which suggests potential visual attention shift; (2) the gaze duration of following a moving target or fixating on a stationary target, which qualitatively measures the current interest level. In general, the longer the duration, the more interested the user might be; and (3) scene analysis and understanding, which helps to detect cognitively whether there are potentially interesting objects within the region of gaze points. By jointly considering the aforementioned three aspects, MemX is able to tackle special corner cases such as mind-wandering and driving scenarios. For example, the user may gaze at a position for a long time unconsciously when mind wandering. In this case, MemX
leverages scene understanding to help decide whether a potential target of interest exists or not. If not,MemX filters out those moments. In another case, the user’s attention may temporally drift away and then quickly shift back if no interesting objects are detected. Such attention shifts can be captured and then discarded (due to short duration) by MemX. We have included pilot studies for these corner cases. The pilot studies demonstrate that MemX can successfully filter out these corner cases and capture the true moments of interest.
6.2.2. Two Exemplary Cases
The following two exemplary cases provide further intuition. Fig. 13 and Fig. 13 show true moments of interest auto-captured by MemX and otherwise discarded by MemX, respectively. Specifically, Fig. 13 and Fig. 13 show successive video frame sequence recorded by MemX with marked gaze positions (red circle). Fig. 13 and Fig. 13 illustrate the time-series normalized gaze distance between two successive frames. For the true moments of interest shown in Fig. 13 and Fig. 13, we can observe that: (1) there is a saccade-smooth pursuit transition at approximately 2.40 seconds; (2) after that, most of the eye movements are smooth pursuit or fixation; and (3) the attention is located on a girl who is playing guitar. In a driving case, as shown in Fig. 13, the driver gazes at a fixed location in his field-of-view without any obvious target or object. As shown in Fig. 13, we can see some gaze shifts, such as at time 17 seconds. However, those gaze shifts do not qualify for possible attention because we cannot find one stable object that the user continuously focuses on. Thus, those moments are discarded by MemX.
6.3. Energy Efficiency of MemX in Pilot Study
Energy efficiency is essential to wearable devices. In MemX, high-resolution video capture and content recording pipeline through the world camera is energy demanding. MemX significantly reduces such use only when potential visual attention and moments of interest are detected. As shown in Table 2, the pilot studies demonstrate that the duration that MemX triggers the world camera and records moments of interest accounts for a small percentage of the total usage time. Furthermore, in MemX, even though the gaze tracking process is always on, this stage has high energy efficiency, thanks to the low data rate of the eye camera and energy-efficient TVA network architecture design. The always-on eye-tracking process is approximately 51.98x more energy efficient than the high-resolution video capture and recording pipeline. In addition, the light-weight fusion network is approximately 43.75x more energy efficient than the VIS pipeline (zhao2021reinforcementlearningbased; yang2019video). Overall, the pilot study demonstrates that, compared with the record-everything baseline, MemX effectively improves system energy efficiency by 86.36% on average. Based on the pilot studies, we estimate that, equipped with 0.36 Wh battery (similar to Spectacles v2), MemX is able to support 8 hours of continuous operation after fully charged, which can meet typical daily usage requirement without frequent charging.
After the pilot studies, we have conducted a questionnaire involving the 11 users who participated in the pilot studies to explore potential personal usage scenarios. All 11 participants agree that human visual attention may serve as an easy-to-use information filter and event detection mechanism for visual content gathering. Based on their feedback, the top-3 popular potential usage scenarios include sightseeing, lifelogging, and sports logging. Some of their comments are quoted below. “MemX is pretty cool when I do sightseeing. With MemX, I can record wonderful scenery and interesting encounters effortlessly.” “I can hand-freely record memorable moments, e.g., social gathering, during my daily life. ” “MemX is perfect for sports, such as cycling, probably a better choice than GoPro.”
Besides personal usage cases, we would also like to explore other domain segments, e.g., industry, education, and gaming, to support on-site visual information gathering and remote communication and interaction. In particular, MemX, as a convenient human-computer interface method, can be integrated with the fast-growing AR and VR technologies. Furthermore, we are in the process of developing a video-editing software to automatically create high-quality video journals, e.g., vlogs, using the visual moments captured by MemX. Our goal is to enable a complete personal interest aware visual moment auto-capture and content creation framework, with the end goal of fulfilling the long-awaited vision of a personalized visual Memex.
However, we have also identified several limitations of the current version of MemX in terms of video quality. In particular, motion compensation is a must-have feature for many eyewear usage scenarios. Furthermore, attention-aware smart glasses may introduce privacy concerns. First, using MemX, users can conduct scene recording in a more discrete fashion. Second, content captured by MemX discloses the user’s personal interest. Our future work will focus on further improving the video quality, and more importantly, address the privacy concerns introduced by attention-aware personalized moment auto-capture devices.
This work aims to realize the decades-long vision of the personalized visual Memex, emphasizing the importance of content capture which must reflect personal interest to stay relevant and valuable. We have developed MemX, a biologically-inspired attention-aware eyewear system to enable auto-capture of personalized attentive visual content, and record moments of personal interest in the form of compact video snippets. MemX is equipped with a new temporal visual attention (TVA) network, which unifies eye-tracking and video analysis to enable accurate and computation-efficient human visual attention tracking and salient visual content analysis. MemX is evaluated using the YouTube-VIS dataset and 30 participants. Our results show that, compared with the eye-tracking-alone method, MemX significantly improves the attention tracking accuracy, while maintaining high system energy efficiency. In addition, we have conducted 11 in-field pilot studies with different potential usage scenarios, which demonstrate the feasibility and potential benefits of MemX. We envision that MemX can potentially benefit a wide range of personal visual content capture scenarios, such as sightseeing, lifelogging, travel experience recording, and event abstraction.
This work was supported in part by the National Natural Science Foundation of China under Grant No. 62090025 and 61932007 and in part by the National Science Foundation of the United States under grant CNS-2008151.