Augmented reality (AR) blends the computer-generated perceptual information into the real world and enhances users’ visual perception of reality. To provide meaningful information, AR must understand the context of physical world azuma1993tracking . Based on the content analysis and understanding, AR technology requires accurate 3D registration of virtual and real objects van2010survey
. With the help of fast-growing computer vision technologies and advanced display technologies, AR is entering more and more application fieldsfurht2011handbook ; billinghurst2002collaborative , e.g., by combining AR with real-time object recognition and tracking technology, users are able to have a fast understanding of surroundings. One intuitive example is that when people are trying to find one target in a crowded room, a bit more time might be wasted if they are not familiar with the surroundings. But if an AR system can help to analyse the scene and provide the auxiliary information, the user will quickly figure out the situation.
The virtual information can change the visual saliency distribution, therefore the focus for the real-world and the focus for the computer-generated virtual information can be complementary or adversarial as shown in Fig. 1. Complementary relationship is required in some specific scenes (e.g., driving). For some other applications like advertisements and visual shielding, the augmented contents are designed to dominate the visual attention. Here we only consider the exploratory behaviors that are driven by intrinsic motivations which are also the scope of visual behaviors in the research of visual saliency. For these scenarios that require a complementary augmentation, the prediction of visual attention is important. Because the integration of virtual content into the real world requires a correct and precise estimation of the user’s viewpoint. Eye trackers can be embedded into the AR system which can be used to capture the direction of eye gaze. And for real-time rendering, the accurate prediction of the head and eye movements of users can help the AR system adapt to the future contexts in advance. However, existing saliency prediction models do not make a distinction between the two different situations, which will cause the inaccurate prediction results.
It is straightforward that the mixture of virtual contents will alter the distribution of visual attention. The relationship between the augmentations and the real-world perception should be considered in the prediction of visual saliency. The design philosophy of complementary augmentation is totally different from the adversarial augmentation. However, existing saliency prediction models do not distinguish the relationship (take all as the adversarial situation). For these tasks that require the complementary AR assistance, in some cases the virtual content is coherent with the interest of the user, i.e., auxiliary information advocates the user’s attention. Sometimes the visual assistant information can reduce the redundant attention, in which case we can say that the auxiliary information is precise to grab the attention and meanwhile maintain the attention and help the user to concentrate more on the targets of interest. But the AR system might make the excessive prediction that covers more objects than the targets of interest, in which case the attention will be dispersed. Contrarily, the insufficient prediction might miss the important target of interest, in which case users will easily change their fixations on the targets without annotations. Besides, the augmentations will follow the targets according to the restriction of 3D registration, therefore they will possess rich motion information. However, there will be temporal redundancy when the augmentation is not informative enough.
Another problem to be solved is how to capture the augmented perception of human eyes and restore in the format of immersive video. As far as we know, there is still no good solution to solve this problem. To make the study reproducible, we turn to employ virtual reality (VR) to simulate real-world environments. Because VR can provide an immersive media experience in which viewers can get a sense of presence. We simulate the see-through videos by annotating on the immersive videos which are regarded as real-world scenes. To investigate the visual saliency distribution in an AR system, we can make a simple implementation that includes the recognition and tracking of objects. The detected objects are marked by the named bounding box which provides the location and the name of the type. Although the implementation doesn’t cover all AR application scenarios, it contains all basic components that a complementary AR application need, and it is a good starting point to start thinking about the visual attention in AR.
In this paper, we introduce an ARVR saliency dataset. Fig. 2 shows some examples from our dataset. Considering the sense of presence in VR, we employ VR to simulate real-world environments. We implement the blending of virtual contents with auxiliary information by using the named bounding box to label and follow the salient objects. We display the augmented videos in a VR head mounted display (HMD) with an embedded eye tracker to record the head and eye movements when users are experiencing the simulated AR. Also, the original videos are also evaluated in the VR HMD. We also present a method to generate saliency maps from head and eye movements and generate saliency videos in which one frame corresponds to the saliency map of the frame in stimulus video. Besides some analyses are made to illustrate how augmented virtual contents affect the visual attention distribution.
The rest of this paper is organized as follows. In Section 2, we shortly review some related works about saliency prediction models for panoramas. Section 3 describes the construction of our ARVR dataset. In Section 4, the process of generating saliency videos is described. Section 5 is the evaluation and analysis of our dataset. Finally, concluding remarks and future works are given in Section 6.
2 Related Works
Visual attention for immersive images/videos has been studied in recent years. Some databases have been constructed for understanding the visual attention in immersive contents and benchmarking saliency models, which contain head and eye movements data. David et al. david2018dataset constructed a database of immersive videos. A total number of 19 videos and the corresponding head and eye movements were included in the database. In duan2018perceptual , 320 panoramas with distortions were included. The head and eye movements data and the quality scores from viewers were recorded and reported.
And some methods for the saliency prediction of panoramic contents are well developed in recent years. In these methods low-level and high-level visual cues are extracted to predict the saliency. Yucheng et al. zhu2018prediction proposed the model to predict the head movement, head-eye movement and scanpath, and a framework was also proposed to extend the existing saliency models designed for 2D images to panoramas. Cheng et al. cheng2018cube proposed a weakly-supervised spatial-temporal network to predict saliency of
videos in which the cube padding was designed to replace zero padding.
However, there is still no saliency database and saliency prediction model for AR. To bridge the gap between the research of visual attention in VR and AR, in this paper, we build a saliency dataset for AR and VR videos. Head and eye movements are recorded by a VR HMD with an eye tracker. Based on the data we generate saliency videos and make the comparisons between AR and VR saliency data.
3 Dataset Construction
3.1 Omnidirectional Video Selection
The classification of omnidirectional videos can be based on the scene classification, e.g., indoor rooms, natural landscapes, people, etc. Besides, some videos contain rich texture information, while others contain less texture information. We divide the videos according to the scene type and spatial complexity. In our dataset, there are 22 outdoor scenes, 22 indoor scenes and 6 computer generated scenes. There are 26 videos that contain rich texture information and 24 videos that contain less texture information. There is at least one person in all videos. The chosen videos are downloaded from the video site “Youtube”, and there will be a document to describe the source of the stimuli in our dataset. All of the videos are in equirectangular format, and the resolution of all videos is. Each video is no more than 30 seconds.
3.2 Video Annotations
The named bounding box is used to label the salient objects. We conduct the annotation following the principle that most salient objects are selected. The bounding box is designed to follow the objects to simulate the 3D registration in AR. Although there are some object recognition and tracking algorithms, they do not work on equirectangular videos. To ensure the quality, we use software called the “Adobe Premiere Pro” and conduct the annotations manually on the sphere frame by frame. There are 50 original omnidirectional videos and 50 augmented videos which constitute our ARVR dataset.
3.3 Subjective Experiment Methodology
The ‘HTC VIVE PRO EYE’ is employed in our experiments viveproeye . This type of HMD can track eye movements precisely and has a wide field of view of 110 degrees, a fresh rate of 90HZ and a high resolution of per eye. Besides, the headphones of the HMD can be used to play the audio.
Twenty subjects participate in our experiment. During the experiments, for the convenience of exploration, viewers are seated on a swivel chair. Carefully calibration of the eye tracking module is conducted for each subject before the recording of head and eye movements. The omnidirectional videos are displayed in a sequence. Each video is no longer than 30 seconds, and 5 seconds of black screen is inserted between each two videos as a rest to make viewers be ready for the next video. All subjects are asked to look around in this step to get natural-viewing visual attention data.
There are two sets of stimuli in our experiments, of which one is the original set that is used to simulate the real-world surroundings, and the other is the augmented set that is used to simulate the use of an AR system. We separate the data recording for AR set and VR set to eliminate the influence of twice experience.
3.4 Dataset Sample Usage
Our ARVR dataset involves the head and eye movements for both the original and augmented videos. The ARVR dataset will be released to facilitate further research.
Visual attention analysis. Our dataset can be employed to analyse the visual attention distribution when viewers are experiencing the AR or VR techniques. Also, our dataset can benefit the development of saliency prediction methods for AR and VR. Besides, our dataset can be employed to compare the differences of visual attention between AR and VR.
Blending quality analysis. Our dataset is designed for the complementary relationship between the real-world contents and augmented contents. Thus our dataset can be employed to predict how the augmented information alters the visual attention, which can somewhat reveal the quality of blending.
4 Saliency Video Generation
4.1 Fixation Determination
The sampling rate is higher than the saccade rate, which will cause that multiple data recordings occur during the saccade. We should reduce the data on the path of a saccade which is uninformative, but we should preserve the recording of smooth pursuit eye movements which occurs when the eye is following a moving object. Besides, some noises may occur which can be caused by the influence of the outside and inappropriate use of the device.
To remove the data on the path of a saccade and meanwhile preserve the smooth eye pursuit, for each viewer, we measure the mean and standard deviation of the data between two frames. We preserve the samples that are within the plus three and minus three standard deviations of the mean. The same operations are conducted to filter the head movements data. Comparatively speaking, both the eye and head movements are the reactions to visual stimuli, but the head movement is an inert one when compared with movements of eye.
4.2 Saliency Map Generation
After the determination of fixations, we can generate the saliency map for each frame of the video. To identify continuous region of interest, a Gaussian filter can be applied on the fixations mit-saliency-benchmark . Some previous works apply the Gaussian filter on the fixation map with equirectangular format directly nguyen2019saliency . However, directly applying the Gaussian filter on the fixation map with equirectangular format is inappropriate, because the equirectangular projection has stretching distortions near poles. To be precise, the Gaussian filter should be conducted on the sphere to offset the distortions of equirectangular map. By applying Gaussian filter on the sphere, we can generate the saliency video for both head and eye movements for each stimulus video. Besides, we also generate one blended saliency map for one video by blending and normalizing saliency maps for all frames.
5 Dataset Evaluation and Analytics
Our dataset consists of the head and eye movements data and the generated saliency video for both head and eye movements, of which one frame is the saliency map corresponding to the frame of the stimulus video. For the adversarial augmentations, the augmentations are designed to dominate users’ visual attention. On the contrary, the complementary augmentations are designed for assistance. In our dataset, the object recognition and tracking are provided as the auxiliary information which can be classified into the supplementary augmentation.
To analyse the distribution of visual attention, two videos are chosen, of which one is the “car race” and the other is the “bicycle training”. In Fig. 3, the blended saliency map is calculated to estimate the overall distribution of visual attention. Besides, we extract several frames from the two videos and extract the local viewport image in the frame for better visualization. The corresponding saliency maps for the local viewport images are also extracted. From the results we can see that the eye movements are well captured by the eye tracker and motion sensors in the HMD. To analyse how the virtual contents of auxiliary information change the distribution of visual attention, Fig. 3 also presents the comparisons of saliency map between the original video and the augmented video. From the blended saliency map, we can observe the similarity between the distribution of visual attention for the original image and the image with auxiliary information. From the local viewport image and the corresponding saliency map, we can observe that the auxiliary augmentation can attract the visual attention to the object itself. We calculate the similarity of blended saliency maps between the original video and augmented video. Area under curve (AUC), correlation coefficient (CC), normalized scanpath saliency (NSS) and Kullback-Leibler (KL) divergence as the commonly used metrics judd2012benchmark ; gutierrez2018toolbox are employed to measure the similarity (AUC and NSS are averaged using two sets of fixations). From Table 1 we can observe that the auxiliary augmentations have limited influence on the distribution of blended saliency. We also calculate the velocity of head movements. Fig. 4 shows the distribution of velocity of different directions and magnitudes, from which we can see that the eye movements for augmented videos are more concentrated and regular (coherent with the movements of objects), which means that auxiliary augmentations are liable to maintain the smooth pursuit eye movements. In Fig. 5, we classify video frames into two groups according to the amount of motion information in bounded areas. Top percent areas of the saliency map for video frames are extracted. We calculate the number of frames when the areas within bounding box intersect with the top percent areas. To make the comparison, the results of original videos at the corresponding areas are also reported. From Fig. 5, we can see that viewers are more liable to fixate at the areas within the bounding box, which is more significant when the video sequence contains small motions in bounded areas. From the results we can further conclude that some auxiliary augmentations like the named bounding box in our dataset are redundant in temporal domain, which means that the visual attention will be captured and maintained on the object itself. And we believe that the accurate 3D registration can mitigate the sense of violation and increase the temporal redundancy.
SalGAN360 chao2018salgan360 as a top-performed saliency prediction model is employed to predict the saliency as shown in Fig. 3. From the results we can observe that the prediction results for the “bicycle training” is consistent with the ground truth. We think the complex background has the luminance and contrast masking effects. However, the prediction results for the “car race” is more sensitive to the augmented contents. More predicted visual attentions are concentrated on the label, but the ground truth shows that the bounding box and label have limited influence on the visual attention distribution. Besides, we also employ the model that is proposed in zhu2018prediction to predict the blended head-eye saliency (saliency frames are blended and normalized) for the augmented videos. From Table 2, although the AUC is high, it has the medium performance for CC, NSS and KL. We believe that the medium performances of the models are caused by the no consideration of the relationship between the augmentation and the real-world perception and the influence of the augmentation on the visual saliency distribution. Therefore, there should be the consideration of the influence of augmentations in the prediction of visual saliency distribution according to the relationship between augmentations and reality.
6 Conclusion and Future Work
The accurate prediction of the head and eye movements of users in an AR system is important, which can help the AR system adapt to the future contexts in advance. We construct an ARVR saliency dataset with 100 diverse videos evaluated by 20 people. The saliency annotations of head and eye movements for both original and augmented videos are collected which constitute the ARVR dataset. Experimental results show that the well designed auxiliary augmentations can capture and maintain the visual attention in a different way from adversarial augmentations. Therefore, in the prediction of visual attention, it is important to distinguish the relationship between the augmentation and the real-world perception instead of taking all as the adversarial. Apart from the complementary relationship, there are still many cases as the fusion of adversarial and complementary conditions. In the future, we will extend the dataset and add more video stimuli to cover more conditions. Besides, the model to predict the head and eye movements will be designed and proposed.
- (1) Ronald Azuma, “Tracking requirements for augmented reality,” Communications of the ACM, vol. 36, no. 7, pp. 50–52, 1993.
- (2) DWF Van Krevelen and Ronald Poelman, “A survey of augmented reality technologies, applications and limitations,” International journal of virtual reality, vol. 9, no. 2, pp. 1–20, 2010.
- (3) Borko Furht, Handbook of augmented reality, Springer Science & Business Media, 2011.
- (4) Mark Billinghurst and Hirokazu Kato, “Collaborative augmented reality,” Communications of the ACM, vol. 45, no. 7, pp. 64–70, 2002.
- (5) Erwan J David, Jesús Gutiérrez, Antoine Coutrot, Matthieu Perreira Da Silva, and Patrick Le Callet, “A dataset of head and eye movements for 360 videos,” in Proceedings of the 9th ACM Multimedia Systems Conference. ACM, 2018, pp. 432–437.
- (6) Huiyu Duan, Guangtao Zhai, Xiongkuo Min, Yucheng Zhu, Yi Fang, and Xiaokang Yang, “Perceptual quality assessment of omnidirectional images,” in IEEE International Symposium on Circuits and Systems. IEEE, 2018, pp. 1–5.
- (7) Yucheng Zhu, Guangtao Zhai, and Xiongkuo Min, “The prediction of head and eye movement for 360 degree images,” Signal Processing: Image Communication, vol. 69, pp. 15–25, 2018.
Hsien-Tzu Cheng, Chun-Hung Chao, Jin-Dong Dong, Hao-Kai Wen, Tyng-Luh Liu, and
“Cube padding for weakly-supervised saliency prediction in 360
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1420–1429.
- (9) “VIVE PRO EYE: HMD with Precise Eye Tracking,” https://enterprise.vive.com/us/product/vive-pro-eye/.
- (10) Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Frédo Durand, Aude Oliva, and Antonio Torralba, “Mit saliency benchmark,” .
- (11) Anh Nguyen and Zhisheng Yan, “A saliency dataset for 360-degree videos,” in Proceedings of the 10th ACM Multimedia Systems Conference. ACM, 2019, pp. 279–284.
- (12) Fang-Yi Chao, Lu Zhang, Wassim Hamidouche, and Olivier Deforges, “Salgan360: visual saliency prediction on 360 degree images with generative adversarial networks,” in IEEE International Conference on Multimedia & Expo Workshops. IEEE, 2018, pp. 01–04.
- (13) Tilke Judd, Frédo Durand, and Antonio Torralba, “A benchmark of computational models of saliency to predict human fixations,” 2012.
- (14) Jesús Gutiérrez, Erwan David, Yashas Rai, and Patrick Le Callet, “Toolbox and dataset for the development of saliency and scanpath models for omnidirectional/360 still images,” Signal Processing: Image Communication, vol. 69, pp. 35–42, 2018.