Monoscopic vs. Stereoscopic Views and Display Types in the Teleoperation of Unmanned Ground Vehicles for Object Avoidance

07/12/2021 ∙ by Yiming Luo, et al. ∙ Xi'an Jiaotong-Liverpool University 0

Virtual reality (VR) head-mounted displays (HMD) have recently been used to provide an immersive, first-person vision/view in real-time for manipulating remotely-controlled unmanned ground vehicles (UGV). The teleoperation of UGV can be challenging for operators when it is done in real time. One big challenge is for operators to perceive quickly and rapidly the distance of objects that are around the UGV while it is moving. In this research, we explore the use of monoscopic and stereoscopic views and display types (immersive and non-immersive VR) for operating vehicles remotely. We conducted two user studies to explore their feasibility and advantages. Results show a significantly better performance when using an immersive display with stereoscopic view for dynamic, real-time navigation tasks that require avoiding both moving and static obstacles. The use of stereoscopic view in an immersive display in particular improved user performance and led to better usability.



There are no comments yet.


page 2

page 3

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The teleoperation of unmanned ground vehicles (UGV) or unmanned aerial vehicles (UAV) is challenging for operators because it is not easy to have a good understanding of the terrain and surrounding objects given that they are away from the actual environment [25]. With advances in real-time image transmission, issues such as low image quality, long delays in image transmission, and unstable signal transmission have been improved significantly, making their teleoperation stable and practical in a wide range of situations [12]. These improvements have allowed for the exploration of different image views that can be shown on different display types to improve the ability of operators to control UGV in a precise and efficient manner.

Current displays used in the teleoperation of UGV include primarily traditional 2D screens (such as those on mobile phones/tablets and typical desktop computers). More recently immersive displays, such as virtual reality head-mounted displays (VR HMD), have started to make their way for UGV teleoperations. VR HMD allow a higher level of immersion and presence (that is, the feeling of being “there” in the environment) [19], which can enhance performance [34, 14]. The disadvantage of VR HMD is that they have small displays and their use may increase the chance of operators having motion sickness, a common issue with HMD that is not present in normal displays [18, 26].

Whether it is a traditional screen or a VR HMD, it is not easy for operators to gauge the distance and depth perception of objects from the video images shown in the displays in real time because, unlike being in the real environment, they often do not provide enough visual stereoscopic information for the operators to see.

Depending on the placement of the camera in the drone, the operator can see images based on the first-person view (FPV) or the third-person view (TPV) [25, 8, 20]. Mainstream drones in the market can supply a FPV perspective primarily. Based on cost factors and the stability of image transmission, most drones use a single camera, typically of high-quality, to transmit monoscopic images to a 2D non-immersive screen display. In the absence of stereoscopic cues, these 2D images do not often provide enough visual information for operators to perceive clearly and accurately distance and depth of obstacles surrounding the drone[25, 3]. This situation could be worse when there are moving objects while the drone is moving towards them.

Using binocular cameras could provide additional information to assist operators gauge distance information of objects. The use of binocular cameras has been studied on UAV mainly but not so much on UGV [25, 3, 7, 11]. At low altitudes but high speed, a UAV with an on-board computer that can provide stereoscopic FPV rendered in a VR HMD could enhance the control of the flying drone [25]. In our research, we explore the use of binocular cameras in UGV, which tend to go slower than UAV and could have more objects on its path, and whether they can enhance teleporters’ stereoscopic perception and their performance in obstacle avoidance tasks in real-time. In addition, we want to compare the usability and performance in both normal displays and immersive VR HMD.

In this paper, we first review related work about viewing modes and display types as well as obstacle avoidance tasks and their performance metrics. We then present two user studies. The first study explores four viewing modes in a normal 2D display and VR HMD (see Fig. 1). We compared these modes based on participants’ subjective feedback and objective results in a pre-designed obstacle avoidance experiment. The version with stereoscopic view in VR had best performance but participants still faced challenges with moving objects. These results led us modify this viewing mode and run a second study comparing the new mode against it. The results show that the new fifth mode could lead to an improved overall performance. The results of these two studies can inform the design of future teleoperator-UGV interaction that offers better performance and user experience.

Ii Related Work

In this section, we first introduce previous work on viewing modes and display types for UGV/UAV. Then we present the tasks for testing robot performance and the metrics for their evaluation.

Ii-a Viewing Modes and Display Types

Viewing modes (e.g., 2D vs. 3D) and display types (VR vs. 2D displays) can affect users’ levels of immersion, flow, and performance [34, 35, 16]. Most teleoperated unmanned systems use a monocular camera that displays video streams captured from the camera attached to a robot. The camera supplies images for operators to see and make decisions for where to go. Because the camera tends to have a small field-of-view (FOV), the video images could only provide limited visual details which in turn can lead to lower performance on tasks such as target detection and identification [4]. It requires operators to put extra effort to survey the environment (e.g., by manipulating and/or rotating the robot to get different views) [2], an inefficient process that can increase mental and visual load and the feeling of motion sickness. Also, important distance cues may not be provided while depth perception may be degraded when the FOV is restricted [33].

When driving a UGV, operators have more difficulties in judging the speed of the vehicle, time-to-collision, perception of objects, locations of obstacles, and the start of a sharp curve [30]. The level of difficulty can increase if the objects around the drone are moving and not static. However, simply increasing the FOV may lead to other issues. For example, in a typical 2D display, this could mean that the users may need to scan a wide view of the environment, requiring moving their head left and right frequently. In addition, a larger FOV may cause additional usability issues with VR HMD as rotation head movements might induce greater motion sickness [18, 26]. The challenge is to provide additional visual cues without significantly changing the FOV and causing large head movements, especially rotational ones.

3D Stereoscopic (3DS) views have been found to be better than traditional non-stereo (2D) views on manipulation tasks with either virtual or real objects [14, 6]. In [25] the authors explored immersive 3DS views for flying drones which enable higher accurate depth perception and led to better teleoperation and navigation performance. Stereoscopic FPV presents significant advantages over monocular FPV[5, 1]. The distance between the binocular cameras used to achieve 3DS also has an important effect. The best performance was achieved when the inter-camera distance was less than the inter-ocular distance, which is 2–3 cm and 6 cm, respectively [10]. However, artificially induced binocular stereo-vision may increase motion sickness and perceived stress [24]. Latency in image transmission is also associated with motion sickness [28, 9]. Low image quality caused by reduced frames per second (fps), reduced resolution of the display (pixels per frame), while a lower gray-scale (number of levels of brightness or bits per frame) [22] can also increase the motion sickness. While motion sickness can be an issue and when it is possible to have high-resolution images without transmission delays [2], the depth information and stereoscopic perception provided by the binocular cameras can significantly improve the operator’s teleoperation performance for obstacle avoidance and precise maneuverability.

Distance underestimation and overestimation can occur when objects are viewed in normal displays and especially in immersive VR environments (VE) [10, 32]. In experiments with teleoperated UGV, it has been found that operators underestimated the distances from obstacles and landmarks [13]. As such, given the benefits of 3DS views in VR, this research aims to explore whether the combination of 3DS and VR could support operators to gauge distance information during teleoperation manipulation and improve their performance in obstacle avoidance tasks.

Ii-B Tasks and Metrics

Teleoperation tasks that require real-time obstacle avoidance have been researched in the context of human-drone interaction. For example, in [17, 23], researchers explored a remotely controlled robot that has been integrated with a laser sensor and a monocular camera to capture distance information. This combination provided operators with images displayed in a virtual UI. The operator was asked to drive this robot through a series of mazes and avoid obstacles. Its performance was measured using a set of metrics [29]: (a) Obstacle encounter, the number of collisions of the robot against obstacles; (b) Efficiency, the time to complete tasks; and (c) Subjective ratings, usability issues with controls and the interface. We followed a similar approach and used these metrics in this research.

In addition, it has been shown that moving objects require a greater depth information of the objects to avoid colliding with them [15]. As such, the maze used in our research (see Fig. 3 on the next page) incorporated both static and dynamic objects of different types. Their combination allows us to explore in detail the comparative performance of teleoperating a UGV based on different viewing modes and display types.

Iii User Study 1

To explore how monoscopic and stereoscopic views and display types affect the distance perception of users, we conducted a 4-condition, 8-person within-subjects study. The main task consisted of participants driving a UGV in real time using a game controller through a maze that had both static and moving obstacles.

Fig. 2: (a) The prototype used in this research. It contains a UGV (DJI RoboMaster S1) and a FPV system which was built with a binocular camera, an on-board mini desktop, and a power bank. (b) Implementation of 3DS view.
Fig. 3: Overview of the maze. (a) The distribution of the different tasks in the maze (T1, T2, T3, T4, and T5); (b) A picture of the actual maze used in the study.
Fig. 4: Overview of the components of the control and image transmission system.

Iii-a Prototype

Fig. 2a shows the prototype we developed to conduct this research. A DJI RoboMaster s1111DJI RoboMaster S1: was used as the mobile UGV. The transmission system consisted of a binocular camera, a mini desktop, and a power bank. The first-person view (FPV) images from the camera were transmitted to a PC and rendered in a traditional 27-inch 4K monitor or VR HMD, which in our case was a HTC Vive Pro Eye222HTC Vive Pro:

The binocular camera consists of 2 monocular cameras with a resolution of 1280 × 720, a 71° FOV, and distortion less than ± 0.3 % [31]. The binocular camera has a stereoscopic view with a 2560 × 720 resolution. It was installed on a mounting platform with Inter Pupillary Distance (IPD) adjustment with a range of about 26mm to 84mm. The average human IPD is 65mm, with a range of 54 to 75 millimeter for young adults, between 16 to 24 years [21]. Therefore, the IPD was set to 65mm in the stereoscopic FPV condition. The right camera was moved to a central position as the input source of the other monoscopic conditions. The FPV system used WebRTC as the video streaming protocol which has a 214 ± 7 (ms) latency at 2560 x 720 resolution (1280 x 720 per eye) and 30 FPS in a Wi-Fi6 connection [27]. A mini desktop with 16GB RAM, an i7-10710U CPU, and an Intel UHD Graphics was installed on the robot as the WebRTC server which was powered by a 100w power bank. The HMD and monitor were connected to a desktop with 16GB RAM, an i7-9700k CPU, a GeForce GTX 2080Ti dedicated GPU.

Participants used an Xbox wireless controller333Xbox wireless controller: as the input device to control the UGV. Fig. 4 (next page) shows the elements of the control and image transmission system consisting of a custom control protocol, robot controller software for Windows, Inter-Process-Communication (IPC), VR and PC interface built in Unity3D, and Xbox controller for controlling the robot remotely.

Iii-B Conditions

In this study, the following four conditions were explored (see Fig. 1). Fig. 1e shows the UGV’s camera pointing at a board with black dots and words to show the difference between monoscopic and stereoscopic images.

  • D1: Non-immersive display with fixed screen using monoscopic images (see Fig. 1A, 1a). We used a traditional 27-inch monitor as the display screen. The content of the display was from monoscopic images (fed from one of the binocular cameras placed on the robot). This is the non-immersive display and the baseline condition as this is what is provided typically in current UGV/UAV via a mobile phone, tablet or regular desktop.

  • D2: Immersive display with fixed screen using monoscopic images (see Fig. 1B, 1b). We constructed a fixed virtual screen in VR. All irrelevant information is blacked out, except for what is displayed. This condition is the monoscopic view in VR.

  • D3: Immersive display with head tracking screen using monoscopic images (see Fig. 1C, 1c). We had this condition because this is the same as the ones that are used for current UGV or UGV that provide VR capabilities. We turned the fixed screen in into a moving screen that follows users’ head motion. It used monoscopic images captured from one of the binocular cameras.

  • D4: Immersive display with head tracking screen using stereoscopic images (see Fig. 1d). We used the same moving screen as D3, but the content of display was stereoscopic images captured using both binocular cameras. Each camera would provide images to the corresponding eye (see Fig. 2b).

Iii-C Tasks and Procedures

To investigate the performance of the four different types of views and display methods, participants had to drive the robot through a customized maze (Fig. 3b). We installed a high-definition video camera and 8 motion tracking VICON444Vicon system: cameras to capture the movement of the robot along the maze and to detect any collisions with the objects in the maze.

For each condition, participants had to drive the robotic car remotely and maneuver it through the maze as fast as possible but without hitting or colliding with the obstacles. The driving maze was designed to have five different tasks (see Fig. 3a):

  • T1: This task consisted of 36 small and static wooden cubes. As the UGV moves deeper into the maze, the distance between the two cubes would become smaller (from 32cm to 30cm to 28cm).

  • T2: This task had two big cubes placed horizontally. The user needs to drive UGV in a side-way manner to prevent collisions.

  • T3: This task had a cardboard box that rotated at a constant speed. Participants would need to judge distance and time their move accordingly to avoid colliding with this spinning obstacle.

  • T4: This task had a circular moving obstacle and a cross shaped, static obstacle. Participants had to judge the distance between the robot and the obstacles and time its move accordingly to avoid hitting both the moving obstacle and horizontal cross at the same time.

  • T5: This task consisted of an obstacle that would move in a straight line in a backward and forward manner and a cube placed vertically. Participants needed to judge the robot’s distance and time its move to avoid colliding with either object.

A simple driving training outside the maze was given to the participants before they started the formal trials. The purpose of this training was to give participants the chance to become familiar with the controls, the HMD, and controlling the robot. After this training, the participants were asked to run the formal trials. Each participant had three trials in each display methods and the order of display methods was pre-determined by a 4 × 4 Latin square to reduce any learning effect.

Iii-D Participants

Eight participants (5 males and 3 females, aged between 20-29, mean = 24.5) were recruited for this experiment. Data collected from the pre-experiment demographics questionnaire show that that they all declared to be healthy and did not have any health issues, physical and otherwise. They all had normal or corrected-to-normal vision and did not suffer from any known motion sickness issues in their normal daily activities. None of them had any experience driving a UGV using HMD in FPV. As such, it was the first time for all 8 participants to drive a remote a UGV using an HMD in FPV.

This experiment has been approved by the University Ethics Committee at Xi’an Jiaotong-Liverpool University.

Iii-E Hypotheses

Based on our review of the literature and experiment design, we formulated the following four hypotheses:

  • H: D4 would lead to the best overall performance in distance perception; D1 would lead to the worse overall performance than D2 and D3;

  • H: D4 would lead to the best local performance in the complex tasks (T3, T4, and T5); D1 would lead to the worse local performance than D2 and D3 in these three tasks (T3, T4, and T5);

  • H: D4 would significantly reduce user demands and would be the most popular display method; D1 would have more user demands and less user preferences than D2 and D3.

Iii-F Results

Iii-F1 Overall Performance

All participants understood the nature of the tasks and all recorded data were valid. If there was a collision and it lasted less than 1 second, then it was considered as one collision only. If it lasted longer than 1 second, for each 1 second of collision time, we counted it as one collision. We recorded the number of collisions for each trial by checking frame by frame the high-definition videos from the camera and VICON tracking system. A Shapiro-Wilk test for normality was performed on each of measures separately for each condition and show that they followed a normal distribution.

Fig. 5a shows the average number of collisions per condition. A repeated measures ANOVA with Greenhouse-Geisser correction showed that the mean of the number of collisions differed significantly between display types (F(1.722, 12.054) = 10.691, p <.05). A Bonferroni post-hoc test revealed that the number of collisions was significantly lower for D2 and D4 (p <.05) compared to D1. There was no significant difference between D2, D3 and D4 (p >.05).

A repeated measures ANOVA with Greenhouse-Geisser correction found that there was no significant difference between conditons (F(1.481, 10.365) = 1.327, p >.05).

Iii-F2 Local Performance

Fig. 5b shows the average number of collisions for each condition in each task. A repeated measures ANOVA with Greenhouse-Geisser correction found that there was a significant difference in T1 (F(1.464, 10.248) = 14.477, p <.05). A Bonferroni post-hoc test showed that the number of collisions was significantly lower in T3 for D1 (p <.05) and D3 (p <.001) when compared to C4.

There was no significant difference among D1, D2, D3 and D4 in completion time.

Iii-G Subjective Results

Fig. 5: (a) Overall mean and std. deviation of number of collisions for the four conditions (D1, D2, D3, and D4); (b) Local mean and std. deviation of number of collisions for the four conditions (D1, D2, D3, and D4) in each task (T1, T2, T3, T4, T5). All Error bars show +1 SD; (c) Average of response scores for each element of UEQ. Error bars show +1 SD. Higher scores represent more preferences in all cases. The ’’ symbol represents the mean value. Error bars show 1 SD.

Iii-G1 NASA-TLX Workload

Kruskal-Wallis H Test was conducted and found no significant difference in any sub-scales of the NASA-TLX workload (Mental, Physical, Temporal, Effort, Performance, and Frustration).

Iii-G2 User Experience Questionnaire (UEQ)

Fig. 5c shows the results of the UEQ. They were evaluated with the Kruskal-Wallis H Test comparing the four conditions for the six UEQ elements (Attractiveness, Perspicuity, Efficiency, Dependability, Stimulation, and Novelty). There was a significant difference in Novelty ((3) = 9.077, p <.05). Dunn’s post-hoc analysis showed a significant difference (p <.05) for D1 vs D4 in Novelty.

Iii-G3 Interviews Results

Overall, all participants had a positive experience in the experiment. None of the participants commented that they had any serious discomfort or simulation sickness in the four conditions. Most said that the most difficult task was T3, the section with a cardboard box that rotated at a constant speed.

Iii-H Discussion

The results confirmed some of our initial hypotheses but also revealed some different effects. We discuss these next.

Iii-H1 Overall Performance

From the evaluation of the overall number of collisions, we have following effects:

  • D2 significantly reduced the number of collisions compared to D1. This supports in part H.

  • D3 did not significantly reduce the number of collisions compared to D1. This seems to contradict H.

  • D4 significantly reduced the number of collisions compared to D1. This also supports in part H.

D2, D3, and D4 all used VR. Their advantage over a normal monitor was that they could increase immersion and eliminate the reflection of light on the 2D display screen. Participants seem to have had better concentration and immersion when the off-screen information was all blackout. According to H, all conditions with VR (D2, D3, and D4) should have improved the performance compared to D1. However, we only found significant improvements in D2 vs D1 and D4 vs D1. This seems to indicate that the existing display modes from products in the market (D3 vs D1), as mentioned in Section II.A (Condition), did not show significant difference in performance for real-time obstacle avoidance tasks. That is, they did not lead to a good performance overall.

For overall completion time, our analysis indicated that the different conditions did not significantly improve efficiency in the tasks, which contradicted H.

In interviews after the experiment, participants commented that they needed more effort in T3

because estimating the distance of the UGV to a moving obstacle with frequent changes in depth perception was very challenging.

Iii-H2 Local Performance

Users behaved differently in different tasks in our experiment. Based on the analysis of collisions, we found that only one task (T3) presented significant differences on performance, while the other tasks did not, which supported part of H (only T3 showed differences). In particular, we found the following:

  • Using the virtual screen in VR (D2) and monoscopic view in VR (D3) did not significantly reduce the number of collisions compared to the 2D screen (D1) in T3.

  • Having stereoscopic images in VR (D4) significantly reduced the number of collisions compared to the 2D screen (D1) and the monoscopic view in VE (D3), respectively.

T3 required participants to judge the distance between the UGV and a spinning box. The difference between this task (T3) and other tasks with dynamic obstacles (T4 and T5) was that it not only changed the motion states of the obstacle but also significantly changed the depth of the obstacle that user needed to perceive from the views. This was where the advantages of 3DS view was observed the most because it allowed participants to obtain a better depth perception and 3D effect. Therefore, the participants could have a better sense of the distance between the UGV and the obstacles, which helped improve their performance. This was supported by the interview data. Participants commented that they needed more effort on T3 because estimating the distance to a moving obstacle with frequent changes in perceptual depth was quite challenging.

Iii-H3 User Demands and User Preferences

The positive results from NASA-TLX data do not fully show equal preference of users for the four conditions, which contradicted the first part of H. The results of UEQ gave us further insights into participants’ preferences. From the results of UEQ, D4 was considered more creative and innovative when compared to D1, which supported the second part of H.

Iv User Study 2

Based on the above results of Study 1, we combined D2 and D4 to create a new display mode - D5. This version would show participants a fixed big screen in VR and allow them to have 3DS vision by giving each eye different images from the corresponding binocular camera. The only difference between D4 and D5 was whether the screen was movable or fixed (see Fig. 6).

Fig. 6: (a) A picture of the UGV’s camera(s) watching a board. (b) View of the left camera of the binocular cameras. (c) View of the right camera of the binocular cameras. (d) Images for the two eyes in D4 and (e) in D5.
Fig. 7: (a) Overall mean and std. deviation of number of collisions for the two conditions (D4 and D5); (b) Local mean and std. deviation of number of collisions for the two conditions (D4 and D5). All Error bars show +1 SD; (c) Average of response scores for each element of NASA-TLX workload. Higher scores represent more demands in all cases; (d) Average of response scores for each element of UEQ. Higher scores represent more preferences in all cases. The ’’ symbol represents the mean value. Error bars show 1 SD.

Iv-a Conditions

We decided to compare the best-performing version (D4) in Study 1 with the new mode (D5):

  • D4: Immersive display with head tracking screen using stereoscopic images. The same as in Study 1.

  • D5: Immersive display with fixed screen using stereoscopic images. We used a fixed virtual screen in VR. The display content was stereoscopic (using both binocular cameras; each camera provides different images to their corresponding eye).

Iv-B Participants

Another 8 participants (4 males and 4 females, aged between 21-27, mean = 22.5) were recruited for this experiment. Data from the pre-experiment questionnaire show that none of the participants had major physical discomfort, health problems, simulator sickness, or vision issues. All of them were able to complete the pre-training successfully. Similar to participants in Study 1, none of these participants had any experience with driving a UGV using HMD in FPV. As such, it was also the first time for all 8 participants to drive a UGV using an HMD in FPV.

Iv-C Experiment setup

The rest of experiment setup was the same as in Study 1.

Iv-D Hypotheses

Based on Study 1, we formulated the following two hypotheses:

  • H: According to the objective results of Study 1, D5 would perform better than D4 overall and especially in T3.

  • H: According to the subjective results of Study 1, D5 would be more preferred by participants than D4.

Iv-E Results

A Shapiro-Wilk test for normality was performed on each of measures separately for each condition and showed that they all followed a normal distribution. Fig. 7a and 7b show the performance of the two modes. From these figures, we can see that:

  • For overall performance, the independent-samples t-test showed that

    D5 significantly reduced the number of collisions (t(78) = 3.473, p <.001) compared to D4.

  • In local performance, the t-test showed that D5 significantly reduced the number of collisions compared to D4 in T1 (t(14)=3.473 = 3.631, p <.05), T3 (t(14) = 2.395, p <.05), and T5 (t(14) = 3.147, p <.05).

Fig. 7c and 7d show the summary of the NASA-TLX and UEQ questionnaire data. An analysis shows that:

  • There was a significant difference in the NASA-TLX workload in Effort ((1) = 7.345, p <.05). Participants needed more effort using D4.

  • There was also a significant difference in the UEQ data in Attractiveness ((1) = 10.678, p <.05). Participants preferred D5 significantly more.

Iv-F Discussion

From the results, we can observe that participants performed better in D5, as indicated by an overall lower number of collisions (especially in T3). This result confirms H. However, we also found that D5 had better performance in T1 and T5, which we had not thought it would be the case. The NASA-TLX results also show a similar trend. Participants did not need much effort to finish the tasks using D5. In addition, the UEQ data showed that participants significantly liked D5 more, which supported H. All the above results show that D5 led to better performance than D4 in overall (local) performance, required lower demands from participants, and led to an enhanced user experience. These results suggest that an immersive display with a fixed screen using stereoscopic images (like D5) is a viable approach to allow operators remotely control a UGV when obstacle avoidance is essential.

V Limitations and Future Work

From the results of the two studies, we can observe that an immersive display with stereoscopic images can significantly increase user performance in obstacle avoidance tasks. They also indicate that a fixed screen in VR led to a better performance than a screen that moves together with head motions.

This research has the following two limitations, which can serve as directions for future work. Even with binocular cameras, it is still not possible to significantly increase the environmental information because of the limitations of the camera’s FOV. If users do not manipulate the camera, they are not able to see a wide view of the environment on a single screen. Cameras that can capture wide views, including 360° panoramic ones, around the robot could be one possible solution but further research is needed to assess their suitability because such views will require users to move their heads frequently, which could lead to higher levels of motion sickness.

In this research, we explored views on two types of displays and did not investigate image distortion or enhancement approaches. Given that the immersive VR display with stereoscopic view led to better performance and higher usability, it will be interesting to see if distorting the images (to enhance certain elements for example) could allow even better performance. Applying edge enhancement could potentially improve the perception of the contours of obstacles and may let operators gauge distance information in a more precise way. This line of research could produce useful and interesting results and applications.

In addition, another way to help improve manoeuvrability to avoid obstacles is to use operators’ physiological information that can be captured during teleoperation. For example, eye gaze is readily available and implicitly provided by operators. Recent research shows that gaze data can be used to improve object manipulation in VR environments [36]. Typically, when approaching an obstacle to be avoided, operators’ gaze would likely be fixated on that object and, when a fixation is longer than a threshold, the system can provide additional information to help steer the UGV towards the most optimal path. As eye trackers are becoming a common feature of HMD, using gaze data could represent a low-cost and efficient approach for improving operator-UGV interaction. Further research is needed to explore how this can be achieved.

Vi Conclusion

In this paper, we have explored four viewing/display modes for real-time unmanned ground vehicle (UGV) control in obstacle avoidance tasks. The aim is to investigate whether these modes allow users to determine UGV-to-obstacle distances. Study 1 evaluated the performance and user preference of the four modes in a maze with moving and static objects that the UGV had to avoid. Results from this study show that the version with stereoscopic images displayed in a virtual reality head-mounted display (VR HMD) led to better performance and usability. In Study 2, a new display mode that combines two modes from Study 1 was compared with the best-performing mode (VR with stereoscopic view). Overall, our results indicate that an immersive VR display with a fixed screen using stereoscopic images is an applicable and suitable approach for improving depth perception when controlling a UGV in real-time in obstacle avoidance tasks, whether static or moving. It also helped participants lower their workload levels and led to an enhanced user experience.


The authors would like to thank the participants for their time and the reviewers for their reviews and useful comments.


  • [1] J. Y. Chen, E. C. Haas, and M. J. Barnes (2007) Human performance issues and user interface design for teleoperated robots. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 37 (6), pp. 1231–1245. Cited by: §II-A.
  • [2] J. Y. Chen, E. C. Haas, and M. J. Barnes (2007) Human performance issues and user interface design for teleoperated robots. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 37 (6), pp. 1231–1245. Cited by: §II-A, §II-A.
  • [3] Z. Chen, X. Luo, and B. Dai (2017) Design of obstacle avoidance system for micro-uav based on binocular vision. In 2017 International Conference on Industrial Informatics-Computing Technology, Intelligent Technology, Industrial Information Integration (ICIICII), pp. 67–70. Cited by: §I, §I.
  • [4] R. P. Darken, K. Kempster, and B. Peterson (2001) Effects of streaming video quality of service on spatial comprehension in a reconnaissance task. Cited by: §II-A.
  • [5] D. Drascic (1991) Skill acquisition and task performance in teleoperation using monoscopic and stereoscopic video remote viewing. In Proceedings of the Human Factors Society Annual Meeting, Vol. 35, pp. 1367–1371. Cited by: §II-A.
  • [6] D. Drascic (1993) An investigation of monoscopic and stereoscopic video for teleoperation.. Cited by: §II-A.
  • [7] H. Duan, H. Li, Q. Luo, C. Zhang, C. Li, P. Li, and Y. Deng (2016) A binocular vision-based uavs autonomous aerial refueling platform. Science China Information Sciences 59 (5), pp. 1–7. Cited by: §I.
  • [8] D. Kim, Y. Go, and S. Choi (2018) First-person-view drone flying in mixed reality. In SIGGRAPH Asia 2018 Posters, pp. 1–2. Cited by: §I.
  • [9] E. M. Kolasinski (1995) Simulator sickness in virtual environments. Vol. 1027, US Army Research Institute for the Behavioral and Social Sciences. Cited by: §II-A.
  • [10] D. R. Lampton, D. P. McDonald, M. Singer, and J. P. Bliss (1995) Distance estimation in virtual environments. In Proceedings of the human factors and ergonomics society annual meeting, Vol. 39, pp. 1268–1272. Cited by: §II-A, §II-A.
  • [11] Y. Ma, Q. Li, L. Chu, Y. Zhou, and C. Xu (2021) Real-time detection and spatial localization of insulators for uav inspection based on binocular stereo vision. Remote Sensing 13 (2), pp. 230. Cited by: §I.
  • [12] J. S. McCarley and C. D. Wickens (2004) Human factors concerns in uav flight. University of Illinois at Urbana-Champaign Institute of Aviation, Aviation Human Factors Division. Cited by: §I.
  • [13] D. E. McGovern (1993) Experience and results in teleoperation of land vehicles. In Pictorial communication in virtual and real environments (2nd ed.), pp. 182–195. Cited by: §II-A.
  • [14] J. P. McIntire, P. R. Havig, and E. E. Geiselman (2014) Stereoscopic 3d displays and human performance: a comprehensive review. Displays 35 (1), pp. 18–26. Cited by: §I, §II-A.
  • [15] J. O. Merritt, V. G. CuQlock-Knopp, M. Kregel, J. Smoot, and W. Monaco (2005) Perception of terrain drop-offs as a function of lr viewpoint separation in stereoscopic video. In Helmet-and Head-Mounted Displays X: Technologies and Applications, Vol. 5800, pp. 169–176. Cited by: §II-B.
  • [16] D. Monteiro, H. Liang, J. Wang, H. Chen, and N. Baghaei (2020) An in-depth exploration of the effect of 2d/3d views and controller types on first person shooter games in virtual reality. In 2020 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 713–724. Cited by: §II-A.
  • [17] C. W. Nielsen, B. Ricks, M. A. Goodrich, D. Bruemmer, D. Few, and M. Few (2004) Snapshots for semantic maps. In 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583), Vol. 3, pp. 2853–2858. Cited by: §II-B.
  • [18] A. B. Oving and J. B. van Erp (2001) Driving with a head-slaved camera system. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Vol. 45, pp. 1372–1376. Cited by: §I, §II-A.
  • [19] R. Pausch, D. Proffitt, and G. Williams (1997) Quantifying immersion in virtual reality. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 13–18. Cited by: §I.
  • [20] A. C. Plascencia, V. Beran, and K. Sedlmajer (2019) Drone sensory data processing for advanced drone control for augmented reality.. Cited by: §I.
  • [21] H. B. Pryor (1969) Objective measurement of interpupillary distance. Pediatrics 44 (6), pp. 973–977. Cited by: §III-A.
  • [22] A. Rastogi (1997) Design of an interface for teleoperation in unstructured environments using augmented reality displays.. Cited by: §II-A.
  • [23] B. Ricks, C. W. Nielsen, and M. A. Goodrich (2004) Ecological displays for robot interaction: a new perspective. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE Cat. No. 04CH37566), Vol. 3, pp. 2855–2860. Cited by: §II-B.
  • [24] D. R. Scribner and J. W. Gombash (1998) The effect of stereoscopic and wide field of view conditions on teleoperator performance. Technical report ARMY RESEARCH LAB ABERDEEN PROVING GROUND MD HUMAN RESEARCH AND ENGINEERING …. Cited by: §II-A.
  • [25] N. Smolyanskiy and M. Gonzalez-Franco (2017) Stereoscopic first person view system for drone navigation. Frontiers in Robotics and AI 4, pp. 11. Cited by: §I, §I, §I, §II-A.
  • [26] C. Smyth, V. Paul, A. Meldrum, and K. McDowell Examining alternative display configurations for an indirect vision driving interface. US Army Research Laboratory: Aberdeen Proving Ground, MD, to be published. Cited by: §I, §II-A.
  • [27] B. Sredojev, D. Samardzija, and D. Posarac (2015) WebRTC technology overview and signaling solution design and implementation. In 2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO), pp. 1006–1009. Cited by: §III-A.
  • [28] K. M. Stanney, R. R. Mourant, and R. S. Kennedy (1998) Human factors issues in virtual environments: a review of the literature. Presence 7 (4), pp. 327–351. Cited by: §II-A.
  • [29] A. Steinfeld, T. Fong, D. Kaber, M. Lewis, J. Scholtz, A. Schultz, and M. Goodrich (2006) Common metrics for human-robot interaction. In Proceedings of the 1st ACM SIGCHI/SIGART conference on Human-robot interaction, pp. 33–40. Cited by: §II-B.
  • [30] J. B. Van Erp and P. Padmos (2003) Image parameters for driving with indirect viewing systems. Ergonomics 46 (15), pp. 1471–1499. Cited by: §II-A.
  • [31] J. Weng, P. Cohen, M. Herniou, et al. (1992) Camera calibration with distortion models and accuracy evaluation. IEEE Transactions on pattern analysis and machine intelligence 14 (10), pp. 965–980. Cited by: §III-A.
  • [32] B. G. Witmer and P. B. Kline (1998) Judging perceived and traversed distance in virtual environments. Presence 7 (2), pp. 144–167. Cited by: §II-A.
  • [33] B. G. Witmer and W. J. Sadowski Jr (1998) Nonvisually guided locomotion to a previously viewed target in real and virtual environments. Human factors 40 (3), pp. 478–488. Cited by: §II-A.
  • [34] W. Xu, H. Liang, K. Yu, and N. Baghaei (2021) Effect of gameplay uncertainty, display type, and age on virtual reality exergames. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–14. Cited by: §I, §II-A.
  • [35] W. Xu, H. Liang, Z. Zhang, and N. Baghaei (2020) Studying the effect of display type and viewing perspective on user experience in virtual reality exergames. Games for health journal 9 (6), pp. 405–414. Cited by: §II-A.
  • [36] D. Yu, X. Lu, R. Shi, H. Liang, T. Dingler, E. Velloso, and J. Goncalves (2021) Gaze-supported 3d object manipulation in virtual reality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–13. Cited by: §V.