Multimodal Interfaces for Effective Teleoperation

by   Eleftherios Triantafyllidis, et al.

Research in multi-modal interfaces aims to provide solutions to immersion and increase overall human performance. A promising direction is combining auditory, visual and haptic interaction between the user and the simulated environment. However, no extensive comparisons exist to show how combining audiovisuohaptic interfaces affects human perception reflected on task performance. Our paper explores this idea. We present a thorough, full-factorial comparison of how all combinations of audio, visual and haptic interfaces affect performance during manipulation. We evaluate how each interface combination affects performance in a study (N=25) consisting of manipulating tasks of varying difficulty. Performance is assessed using both subjective, assessing cognitive workload and system usability, and objective measurements, incorporating time and spatial accuracy-based metrics. Results show that regardless of task complexity, using stereoscopic-vision with the VRHMD increased performance across all measurements by 40 monocular-vision from the display monitor. Using haptic feedback improved outcomes by 10 improvement.



page 1

page 6

page 9

page 10

page 13


Workload-Aware Systems and Interfaces for Cognitive Augmentation

In today's society, our cognition is constantly influenced by informatio...

Quantitative Physical Ergonomics Assessment of Teleoperation Interfaces

Human factors and ergonomics are the essential constituents of teleopera...

Direct and Indirect Communication in Multi-Human Multi-Robot Interaction

How can multiple humans interact with multiple robots? The goal of our r...

Kinesthetic Learning – Haptic User Interfaces for Gyroscopic Precession Simulation

Some forces in nature are difficult to comprehend due to their non-intui...

Multi-person Spatial Interaction in a Large Immersive Display Using Smartphones as Touchpads

In this paper, we present a multi-user interaction interface for a large...

Automatically matching topographical measurements of cartridge cases using a record linkage framework

Firing a gun leaves marks on cartridge cases which purportedly uniquely ...

Cross-artform performance using networked interfaces: Last Man to Die's Vital LMTD

In 2009 the cross artform group, Last Man to Die, presented a series of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The growth of virtual reality, robotics and networking technologies have spiked in recent years. This has led to an increase in teleoperation research – allowing humans the ability to remotely inhabit a foreign body, e.g. a robot as an avatar to complete a task [58]. With the recent outbreak of pandemics, remote robotic control and telepresence systems, have become more important than ever.

Teleoperation delegates the high-level control of a robot to a remote human operator, thus combining the human instinct and the computational as well as the physical capabilities of robots. Humans are highly adaptable experts in motion control, constituting teleoperation a useful tool to help robots complete tasks in novel and dynamic environments. During a teleoperation task, the robot’s performance is dictated by the controls being sent by the human. So how can we maximize human perception and thus maximise performance during task supervision?

The actions between an operator and a remote robotic system are physically detached from another constituting the overall experience unnatural. This implies that policies which humans use to control their own bodies, may not directly translate into effective control of a foreign body, which can lead to poor performance. To mitigate this, we can maximise feelings of immersion and by extent task performance in humans so that the foreign body feels more like their own. This can lead to improved performance when controlling another body in a remote environment [53, 12]. Increasing immersion, translates in increased performance [67, 15, 26].

To increase the feeling of immersion and thus performance, we can alter the way in which the human interacts with their avatar. In a primary setting, users can interact with their surrounding virtual setting, by using a monocular monitor to give them a visual representation of the environment in which they are operating, which may not necessarily lead to high levels of immersion by itself. Using a virtual reality device could lead to increased performance because it offers richer visual information, particularly attributed to stereoscopic depth [36, 33]. Stimulating other senses however, for example, using auditory and haptic feedback to superimpose information may also affect performance.

Previous work has compared the effect of combining some sensory interfaces. However, exhaustive comparisons, to the best of our knowledge, have yet not been made between visual, haptic and auditory sensory modalities and how combining these affects task performance of varying complexity. Our work aims to address this.

We use a pick and place task to compare the effects of these sensory interfaces on task performance. The setup for this task can be seen in Multimodal Interfaces for Effective Teleoperation. The pick and place task is set in a virtual environment with different objects types, sizes and pick and place distances. We compare all combinations of visual (monocular or VR), auditory (presence or absence) and haptic (presence or absence) feedback. Changing these factors affects the difficulty of the task and we present a detailed analysis on how each combination of sensory inputs affects task performance.

Our study provides evidence to support a recommendation for the best performing combination of sensory interface in manipulation tasks with varying complexity. By incorporating both subjective and objective measurements, we determine which combination offers the best performance for a given task. Throughout this paper, we present how we conduct, evaluate and analyse our experiments.

Contributions of our work include:

  • A unique and reproducible interface which allows various combinations of sensory feedback for performing various tasks under different settings,

  • A low cost hardware and simple software approach in designing an effective vibrotactile haptic data glove,

  • A virtual reality environment with high-fidelity physics simulation (friction, collision, contact forces) to closely resemble real-world interaction and make the best use of existing human motor skills,

  • A concrete experimental design that can be used to test the effectiveness of new emerging technologies,

  • To the best of our knowledge, this is the first exhaustive comparison of its kind between all combinations of visual, auditory and haptic interfaces during manipulation tasks of increasing difficulty.

Ii Related Work

In this section, we discuss the previous work regarding the effectiveness of multi-sensory interfaces on immersion and performance, object interaction and manipulation. We group these studies in the individual sensory modalities for clarity and identify gaps in current knowledge.

Ii-a Multi-Modal Interfaces

When operators embody a remote robot or are subjected to a virtual environment for training purposes, using only a visual monocular monitor, they can only experience that remote environment visually. By adding multiple modalities, it was found that the workload of the visual cortex can be reduced, awareness may be increased and thus task performance can be improved [34, 10]. But when using multi-modal interfaces, synchronisation is important. If signals of different modalities are out-of-sync, overall spatial and temporal immersion is reduced, effectively nullifying the benefits of using multi-modal interfaces [42, 44].

Furthermore, sensory feedback strategies need to be made prior to the implementation of a specific sensory channel. In most cases, the design decisions of one type of sensory feedback may be achieved via either a continuous manner, i.e. concurrent feedback, or after a desired event, i.e. terminal feedback [53].

This study focuses on audiovisuohaptic interfaces, due to vision, hearing and touch being the highest developed and contributing the most in embodiment [42, 22, 12], among all human senses. We present the previous work on visual, audio and haptic interfaces in the following sections.

Ii-A1 Visual Cues

Most research in this area has focused on the effect of visual interfaces between the human and the avatar. The dominance of vision in the sensory system is well supported [46, 30, 39], contributing to around 70% of overall human perception [22]. Providing thus visual information in the best form is of vital importance.

The two primary sources of a visual interface are standard monocular monitors and VRHMD, which provides stereo vision. During a target detection task in Unmanned Aerial Vehicles (UAVs), there were no significant differences in performance between the two [9], with the VRHMD even causing motion sickness potentially attributed to the illusion of self-motion of the vehicle. This is known as vection [28] and is a common complaint among VRHMD users in non-static situations. Since this is still an open problem, our study limits self-motion and compares the effectiveness of both displays in static scenarios.

Though VRHMDs have drawbacks, they also offer many benefits over monocular screens. They offer better depth perception and environmental awareness than standard monitors [47]. This is of importance as studies have shown that humans overestimate their ability to perceive depth in virtual environments [33, 65, 60]. As such, increased depth information leads to reduced collisions with the surrounding environment and better performance during highly dexterous manipulation tasks [36, 49]. It is important, however, how the superimposition of information is delivered to the operator. One study showed that constantly providing feedback can be counterproductive both in user preference and time efficiency compared to providing feedback at the end of a task [51].

Providing a larger field of view can also result in increased performance and environmental awareness [49, 57, 27], but can decrease usability and increase perceived difficulty i.e. workload [27].

Ii-A2 Auditory Cues

Supplementing vision with auditory information can lead to increased operator awareness, especially during high visual load [52, 53] and can reduce mental workload, correlated to fewer accidents and better performance [23]. Audiovisual interfaces also contributed to intuitive control of a humanoid during manipulation tasks [61].

Extra information, such as alarms and alerts can be superimposed on a visual display, but by presenting them via an audio interface can decrease distraction [50]. Operators can also use auditory information to localise the sources of sounds, which is useful when FOV is limited [54].

Further studies also show that controlling auditory pitch may influence object clearance, limited during human walking, with results indicating that participants indeed benefited from such sound sonification [16, 64]. This suggests that auditory information may provide a richer environmental experience and be a valuable supplement to just relying on vision.

Ii-A3 Somatosensory Cues

Tactile feedback can also augment visual information. Communicating spatial alerts via somatosensory means can signal warnings without overloading visual pathways [14, 45]. Manipulation, in particular, can be improved by adding tactile feedback [19, 58, 68] and can result in better performance [7].

For diagnostic surgery simulators using virtual reality, complex and sophisticated tactile approaches for force feedback have been developed to allow realistic reaction forces for deformable objects such as soft tissue [62]. Further research in kinesthetic force feedback, has shown some advantages over lower cost approaches [55, 20], particularly due to being able to constrain the grasp motion of users hands, based on the virtual object they are holding [13].

However, providing high resolution haptic feedback alone does not necessarily guarantee an increase in task performance [6]. Using only vibration feedback can increase spatial awareness for non-deformable i.e. rigid objects [3]. Outputting vibrations which are proportional to the force applied by the robot, also leads to improved performance [41]. We use a similar approach.

Ii-A4 Audiovisuohaptic Multi-Modal Interfaces

A combination of all three modalities may also be effective in improving performance. One study hypothesises that audiovisuohaptic interfaces may increase task performance as the task gets gradually more difficult [53], but this is untested.

On one hand, an audiovisuohaptic interface did not significantly increase performance during a teleoperated navigation task [34], but operator spatial ability and subjective performance did increase compared to using fewer interfaces. In another study, an audiovisuohaptic interface was implemented to test for performance in visual throwing tasks [17]. While not exhausting all comparisons of the interface or implementing varying task complexities, their results show that point-based haptic devices and moreover auditory feedback did not contribute to significantly improved task performance.

A meta-analysis of 45 studies showed that by supplementing visual information with either auditory or somatosensory (via vibrotactile cues), increased overall performance [11]. However, no extensive comparison has been conducted on how combining all three modalities affects immersion and by extent task performance reflected on higher levels of complexity. Our study aims to address this.

Ii-B Object Interaction and Manipulation

To compare the effect of visual, auditory and haptic feedback on task performance, we must first define a task. We chose to measure the effect of these interfaces on manipulation tasks of different difficulties. Manipulation is a suitable choice, since it involves coarse and fine motor movements, depending on the object being grasped.

The Southampton Hand Assessment Procedure (SHAP) [35], defines six clinically validated grasping classifications to test hand function. This comprises the entire range of human hand motion from fine to coarse manipulation. One study even addressed all the possible different grasping techniques a human can initiate with an object by implementing the SHAP in the physics engine MuJoCo, however, no comparison between the sensory modalities was drawn [32]. We are undoubtedly inspired by the aforementioned study. During our experiments, we use a range of different objects and sizes. By doing this we can examine the effect of combining sensory interfaces on the performance of different levels of human motor skill during object manipulation and interaction.

Our aim is to increase task performance by improving immersion. However, immersion is a complex phenomenon which can be negatively influenced by the so-called "Uncanny Valley" – a break in immersion when an artificial being appears too realistic, causing negative responses towards it [40].

More relevant to this study is the "Uncanny Valley of Haptics", which has a similar effect when haptic feedback does not coincide with other sensory feedback and reduces the perception of realism [6]. Neuroimaging studies support this concept, showing that visual and haptic activation overlaps in the occipital lobe [1, 4, 25, 48]. We aim to investigate if the simultaneous presence of both modalities increases performance.

Iii Hypotheses

The following hypotheses are formed from our review, while primarily hypothesizing that an audiovisuohaptic multi-modal interface will prove to be significantly more effective when subjected to higher task complexity, compared to fewer modalities present or the minimal representation of these.

Hypothesis 1: There will be lower perceived cognitive workload corresponding to higher performance with (a) the stereoscopic VRHMD than with the monocular display monitor, (b) presence of somatosensory feedback than absence and finally (c) presence of auditory feedback than the absence of it.

Hypothesis 2: There will be higher perceived system usability corresponding to higher performance with (a) the stereoscopic VRHMD than with the monocular display monitor, (b) presence of somatosensory feedback than absence and finally (c) presence of auditory feedback than the absence of it.

Hypothesis 3: Faster performance corresponding to less placement and completion time will be observed with (a) the stereoscopic VRHMD than with the monocular display monitor, (b) presence of somatosensory feedback than absence and finally (c) presence of auditory feedback than the absence of it.

Hypothesis 4:

Better depth estimation with less distance error to target, will be measured in the order of interface conditions incorporating (a) the stereoscopic VRHMD than with the monocular display monitor, (b) presence of somatosensory feedback than absence and finally (c) presence of auditory feedback than absence.

Hypothesis 5: Higher placement precision, including higher spatial position and orientation accuracy, will be measured in the order of interface conditions incorporating (a) the stereoscopic VRHMD than with the monocular display monitor, (b) presence of somatosensory feedback than absence and finally (c) presence of auditory feedback than the absence of it.

Iv Methodology

This section describes the key hardware and software components in our study. First, to test our hypotheses, we designed a series of experiments. During these experiments, participants performed a pick and place task under various conditions. All possible combinations of a visual, auditory and haptic interface are assessed. Each modality has two modes, as detailed in Table I, providing a full factorial study.

Vision Audition Haptics
Monitor VRHMD Absence Presence Absence Presence
C1 X X X
C2 X X X
C3 X X X
C4 X X X
C5 X X X
C6 X X X
C7 X X X
C8 X X X
TABLE I: The multi-modal interface broken down into the possible combinations of visual, auditory and haptic feedback.

Audition and haptics can be on/off, whereas vision can be either represented by a monocular i.e. display monitor or a stereoscopic i.e. VRHMD. All combinations of these modalities amount to combinations. Assessment of performance is achieved via both objective and subjective metrics. Participants completed manipulation tasks under each of the above conditions.

Iv-a Participants

A total of () participants were recruited in this study via an advertisement at the University of Edinburgh. Ages ranged from 21 to 44 (, , ), with 6 females and 19 males. Each had healthy hand control and normal/corrected vision. A 30-minute interactive experience using the VRMHD was given as compensation.

Iv-B Equipment and System Setup

For visual feedback, a computer monitor was used for the monocular condition and a VRHMD for stereoscopic vision. The monitor was a 27 inch HP Elite IPS display, with 2560 x 1440 resolution and 60Hz refresh rate placed 75-100cm from the participant. The VRHMD was HTC Vive Pro with 3.5 inch AMOLED screen at 2880 x 1600 resolution (1440 x 1600 pixels per eye), 90Hz refresh rate and FOV. High-resolution displays were chosen to limit distance overestimation and a degraded longitudinal control [5]. An NVIDIA 2080 RTX Ti was used to ensure consistent frame-rates.

Two stereo headsets provided the audio interface. One was integrated onto the HTC Vive VRHMD for stereo conditions. The other was separately attached during the display monitor monocular conditions. Audio quality was at 16 bit, 44100 Hz.

To provide haptic feedback, we constructed a custom haptic glove inspired by [29], which incorporated a vibration motor on the thumb and index finger of the glove. The vibration intensity in their study was accomplished and influenced by the proportion of the size of the virtual object the user was colliding and touching with. While their approach indeed shows a promising step towards immersive experiences in the branch of entertainment, we took their method a step forward by incorporating physical properties including kinetic energy and object penetration for manipulation scenarios detailed further along this paper, specifically in the methodology section. In the construction of our custom glove, 15 coin-vibration motors were used, with DC 3V 70mA 12000 RPM. Two motors were placed on each finger (proximal & distal phalanges). Five motors were placed on the palm. Wireless communication between the virtual environment and the glove ensured free movement. This was achieved using a Bluetooth transceiver for each glove.

We chose to use vibrotactile stimulation rather than force feedback for its lower cost and certain advantages over force feedback. Preliminary findings indicate that force feedback is only more beneficial than vibrotactile stimulation when presented at a high-resolution [6, 63]. But this increases cost and size. Air jet driven approaches exist for force feedback, while these show significant effectiveness, they nonetheless require large space and pose a substantially higher cost compared to vibration approaches [59]. Vibratory feedback, however, can be more beneficial than force feedback in direct manipulation tasks such as ours [31]. Overall vibrotactile stimulation is shown to be an effective substitute for force feedback according to another study [37].

The manipulation task for this study was performed by mapping the user’s hand movements to an anthropologically human-robot hand in the simulation environment. To capture hand movements, we used the Leap Motion Hand Controller (LMHC). This uses a stereo camera system and infrared LEDS to capture hand motions. In all conditions, the device was fixed to the participant’s forehead, either by a strap or on the front of the VRHMD. The LMHC was able to track the haptic gloves, as anthropomorphic features were retained.

Iv-C Software and Simulation Environment Setup

In our experiments, the participant’s conducted manipulation tasks in a virtual environment. As such, this study required a virtual environment which was connected to the hardware. The relationship of these components is shown in Figure 1.

Fig. 1: Diagram of the simulation setup with all the software plugins used.

The Unity3D engine was used as the core of our virtual environment. Two Shadow robotic hands acted as teleoperated manipulators. Physics simulations of the environment used the Unity3D engine, whereas robotic hand physics were handled by the ROS-Sharp physics engine. Unity obtained hand positions from the LMHC via the Leap Motion SDK. A plugin was developed to communicate between the Unity environment and haptic gloves via a Bluetooth module on the glove’s Arduino controllers.

Iv-D Hand Manipulation and Control

The Leap Motion SDK outputs Cartesian joint positions in world frame, but joint angles are required to control the virtual hand. This translation was made by calculating the angle between a joint and its parent .


A Proportional Derivative (PD) controller was used to control the joints. Each joint has one PD controller, formulated as follows for each timestep :


where is the angular velocity control signal sent to the Shadow hand joints. is the current position error between the human joint and the robot joint and is the velocity error between the robot and the desired velocity, which here is set to zero. and are the gains which were tuned such that human and robot motion matched as accurately as possible. Depicted in Figure 2.

Fig. 2: Hand control approach through direct joint angle re-targeting from our custom haptic glove to the final robotic hand.

Iv-E Sensory Interface Design

Iv-E1 Visual Stimulation

We compare monocular and stereo feedback in our experiments using a generic display monitor and a VRMHD respectively. In addition however to providing visual disparity, the VRHMD also allows users to control the viewpoint in the virtual environment by moving their head. To conduct a fair experiment, we allow participants to change their viewpoint when using the monitor by using a computer keyboard using standard gaming keybindings, retain the optical hand controller consistently in a head-mounted state as well as using a monitor of similar resolution to the VRHMD. Acclimatization to these controls and technologies were allowed prior to commencing the experiments, detailed further along this work.

Iv-E2 Auditory Stimulation

We hypothesize that auditory feedback will contribute to increased performance. Every day sound effects, "that make sense", were used to investigate how sound may compensate the superimposition of visual information without requiring prior context or explaining these to the participants i.e. would be inherently perceived as a substitute to text. Audio feedback is given in two situations.

Firstly, warnings and notifications were given via audio. A high-pitched alarm sound warned of imminent collisions between the robotic hands and the environment. A siren alarm sound on the other hand indicated time was running low. A successful "ding" indicated that at least part of an object had been placed inside the target volume irrespective of the placement accuracy.

Auditory feedback also relayed the sounds of interactions in the environment. Picking up, dropping or placing an object produced realistic bump and scrape sounds one would expect when interacting with real objects.

Iv-E3 Somatosensory Stimulation

Vibration is applied to the gloves of the participants when the robot collides with the environment. Here we describe how the vibration intensity is determined. We are inspired by a similar and very early study using appropriate "collision" signals to transmit variable frequency tactile feedback [58]. In a more previous study investigating vibrotactile approach, vibration intensity applied to users was proportional to the size of the virtual object being manipulated [29]. We adopt this approach where instead, vibration intensity is proportional to kinetic energy and object penetration of each finger segment in simulation. These are then combined to give the final intensity.

Kinetic energy of the virtual collision is formulated as:


where is the body mass and the velocity between the robot segment and the environment.

We use the relative penetration of between the robot and the environment as a proxy for force. Since our environment is simulated, we have access to the full state space of the environment. Penetration can then be easily defined by the relative distance between the robot segment and virtual object and the distance between the centre and surface of the object as shown in Equation 4


Equation 3 and Equation 4 can then be combined to calculate total vibration intensity shown in Equation 5


where is the final vibration intensity transmitted to the vibration motors, the minimum vibration intensity needed to distinguish vibrotactile stimulation when in contact. This is set to 25% based on a pilot study consisting of five participants. The second term calculates the vibration intensity based upon the kinetic energy exerted and is controlled by a constant . is the maximum vibration intensity of the hardware, is the maximum calculated kinetic energy in Joules with a velocity limit of 7 set in the physics engine and is the current kinetic energy exerted to the object. The kinetic energy is only applicable during the object acquisition and as masses are constant, it is only dependent upon the velocity of grabbing i.e. picking up. The final term calculates the vibration intensity based upon the penetration of the robotic hand with the object and is controlled by a constant . is the maximum penetration allowed which in our case is 100% and finally, is the current penetration exerted to the object. Figure 3 illustrates our haptic glove in addition to itś electronics, drive control board and motors exposed.

Fig. 3: Haptic glove (left module shown) in its final and first iteration with it’s electronics and motors exposed in the latter.

Iv-F Manipulation Tasks of Varying Complexity

All tasks required the participants to pick up an object from a set starting point and place it to a designated random target location illustrated with a slight-transparent shape. We integrated three basic types of three-dimensional object shapes to not only introduce the inherent different complexities that come with such objects but also to be able to assess different grasping techniques [32, 35]. While different shapes do indeed vary the task complexity, we also introduced different object sizes as well as placement distances.

Iv-F1 Task A - Cube Manipulation

The first task included manipulating a cube shape. A cube was used, as it does not flip or roll and we can assess both its position and rotation accuracy. Grabbing techniques employed included Precision Grasping via Palmar Pinch [35].

Iv-F2 Task B - Cylinder Manipulation

The second task included manipulating a cylinder shape. A cylinder can flip over and roll over a surface, making the task harder. We can also assess both the cylinder’s position and rotation. Grabbing techniques employed included Precision Grasping via Palmar Pinch, as well as Cylindrical Grasping, also known as Power Grasp [35].

Iv-F3 Task C - Sphere Manipulation

The third and final task was concerned with the manipulation of a sphere-shaped object. This was considered to be the hardest task due to the inherent ability of a sphere to roll over an even ideally horizontally placed surface if sufficient velocity would accumulate either from an inadequate precision velocity placement or release from a height offset. Grabbing techniques employed included Precision Grasping via Palmar Pinch as well as Spherical Grasping [35].

Iv-F4 Object Scale and Placement Distance

The aforementioned tasks are broken down into two sub-tasks assessing two object scales, large 50.0 x 50.0 x 50.0 (mm) (LxWxH) and small 30.0 x 30.0 x 30.0 (mm) (LxWxH). Furthermore, the aforementioned sub-tasks are broken down into sub-sub tasks assessing placement distances, defined as the absolute distance from the set starting point to a random target location with distances ranging from 150.0, 300.0 and 600.0 (mm), making it progressively more difficult. Taking into consideration our 8 interface conditions, two different object sizes, three different object shapes and distances, accounted to a total of 144 trials were conducted per participant. Across all participants, a total of 3600 trials were recorded. All of the manipulation tasks are visually depicted in Figure 4.

Iv-F5 Task Progression and Succession

Progression to the next task is achieved when there is an intersection between the actual object and target position, regardless of the accuracy. When an overlap is achieved, the target placement slightly glows and a two-second progression timer is initiated which only pauses when the object does not retain its position. This countdown only pauses when the object is no longer colliding with the target placement volume i.e. indicating that the object has either been moved or has not remained stationary. Task progression is also achieved if the countdown timer, which has been set to 30 seconds for all tasks, reaches zero, however, in that case, the task is considered a fail rather than a success. Finally, for all tasks, an invisible collision wall was implemented to avoid objects falling out of physical bounds rendering a retrieval impossible.

Fig. 4: Image (Upper): All manipulation tasks illustrating the different three dimensional shapes, sizes as well as distances from 150, 300 and 600; green, yellow and red respectively. Tree (Bottom): All 18 tasks broken down in type of object shapes (red), sizes (blue) and distances (green).

V Evaluation

To evaluate each interface across all manipulation tasks, we implemented both subjective and objective measurements, since immersion and perception are highly subjective and our tasks are objective. We implemented both measurements to compensate for the inherent drawbacks of exclusively using questionnaires [56, 24]. Measurements are summarized in Table II.

V-a Subjective Measurements

We first measured cognitive workload for each interface condition through the use of the multidimensional assessment tool questionnaire NASA-Task Load Index, simply known as NASA-TLX [21]. Incorporating six sub-scales including mental demand, physical demand, temporal demand, effort, frustration, and performance.

In addition, we assessed overall system usability, through the use of the System Usability Scale questionnaire, just known as SUS [8]. Consisting of ten in total questions on a 5-point Likert scale, which range from "strongly disagree" to "strongly agree", evaluating system complexity, consistency and cumbersomeness.

V-B Objective Measurements

Overall task performance was measured by first comparing the total proportion of successful task completion, defined as placing the object to the target location within a time-countdown window of 30 seconds for each task regardless of accuracy, however, a minimum overlap with the target volume was required.

Time-based metrics were also incorporated, specifically placement and completion time to assess how fast performing each interface was. Placement time was defined as the time it took users to pick up the object and place it to the target location with potential accuracy corrections afterwards not being assessed, strictly the time stamp of the object and the target volume being in their very first collision. Completion time, on the other hand, was defined as the overall time it took users to complete successfully a task.

In addition to time, spatial-based metrics were also implemented to assess the accuracy of placing objects and how each interface may affect these, which is vital in remotely piloted systems concerned with fine manipulation. Target distance error was considered at the end of each task and defined as the distance between the center of the object and the target location, with higher values indicating worse performance. In addition, position accuracy was calculated by averaging all three axes from the euclidean space center of the object and the target (X,Y,Z) in one final percentage value. Orientation accuracy was similarly calculated but dependent upon the three-dimensional shape. For the cube and cylinder, a modulo operation of 45 and 90 degrees was performed respectively. Assessing the orientation of the sphere was disregarded due to its inherent shape.

Measurement Type Metric
Cognitive Workload
Subjective Questionnaire [Likert Scale]
System Usability
Subjective Questionnaire [Likert Scale]
Task Succession
Objective Percentage [%]
Placement Time
Objective Seconds [s]
Completion Time
Objective Seconds [s]
Target Error
Objective Meters [m]
Position Accuracy
Objective Percentage [%]
Rotation Accuracy
Objective Percentage [%]
TABLE II: Summary of both objective and subjective measurements.

V-C Procedure

Prior to commencing the experiment, participants were briefed on the purpose of the experiment, gave formal written consent and were handed out the NASA-TLX as well as the SUS questionnaires to allow acquaintance with the scales. Once users got familiar with the questionnaires, their interpupillary distance (IPD), was measured for the VRHMD and they were allowed for 10 minutes to get acquainted with the simulation environment. During this acclimatization procedure, the participants were able to familiarize themselves with the keyboard controls and the technologies implemented in the actual experiment, but not with the actual tasks. Furthermore, due to having eight different interface conditions, we also randomized the order of these multi-modal interfaces for each participant, to counterbalance potential acclimatization or task adaption.

Vi Results

Vi-a Analyses Techniques and Methods

To analyze our results, where parametric and without normality violation, via a Shapiro-Wilk Test, a repeated-measures analysis of variance was used (RM-ANOVA) in addition to post-hoc analysis for pairwise comparison of the eight different interface conditions. Where sphericity was not met, via Maulchy’s Test, a Greenhouse-Geisser correction was used to account for the violation and correct the degrees of freedom assuming a

, otherwise a Huynh-Feld correction was used [2].

For non-parametric data, specifically for ordinal data i.e. likert scales, an Aligned-Rank Transform (ART) [66] was used to allow the use of parametric tests i.e. RM-ANOVA. For non-parametric continuous data, a Friedman’s test, similar to the RM-ANOVA, was used to test for significance across the eight interface conditions [18], and Wilcoxon signed-rank tests for post-hoc analysis for the pairwise comparison across the interface conditions.

Samples that were classified as a Bernoulli distribution, the proportion of successful completion, a two times standard deviation from the mean was considered significant (95% CI) i.e. empirical rule

[43]. Hereinafter, for all reported results, the significance levels are: * , ** , *** and n.s not significant.

Finally, in the Appendices, we summarize the overall results of each interface conditions across all measurements, thus giving new evidence to the hypothesized and untested effectiveness of each interface condition suggested by Sigrist et al. [53].

Vi-B Subjective Results

Vi-B1 Perceived Workload

For the perceived workload, an ART was used to allow the use of parametric tests on ordinal data. A one way RM-ANOVA with a Greenhouse-Geisser correction () was used, yielding a highly significant difference across all eight interface conditions, (,,). Mean responses for perceived workload demand are shown in Figure 5 and Table III. Post-hoc analysis showed partial support of hypothesis H1, specifically (a) that conditions incorporating monocular vision with the display monitor, C1,2,3 & 4, accounted to significantly higher perceived workload () than stereoscopic vision with the VRHMD, C5, C6, C7 & C8. Furthermore, (b) conditions incorporating somatosensory feedback only when paired with stereoscopic feedback. C6 & 8, showed significantly lower perceived workload () than those who do not: C5 & 7 and when paired with monocular feedback, marginally lower workload was observed with somatosensory C2 (), than only monocular C1. Finally, (c) conditions with audition only did not contribute to an observable difference in workload ().

Fig. 5:

Box plot illustration across all eight interface conditions of the mean perceived workload, with higher scoring equal to worse performance. Dots represent outliers.

Vi-B2 Interface Usability

For the perceived system usability, an ART was used to allow the use of parametric tests on ordinal data. A one way RM-ANOVA with a Greenhouse-Geisser correction () was used, yielding a highly significant difference across all eight interface conditions, (,,). Average responses for interface usability are shown in Figure 6 and Table III. Post-hoc analysis revealed that the same trend holds true for the system usability, as with the cognitive workload. Specifically, we again found partial support of our H2 hypothesis with (a) stereoscopic vision with the VRHMD C5,C6,C7 and C8 accounting to significantly higher interface usability than monocular vision with the display monitor C1,C2,C3 and C4, (), (b) somatosensory feedback further increasing overall usability however again only when paired with stereoscopic visual feedback C6 and C8 (), and finally (c) auditory feedback by itself making no significant difference ().

Fig. 6: Box plot illustration across all eight interface conditions of the mean interface usability, with higher scoring equal to better performance. Dots represent outliers.
Subjective Measurements NASA-TLX SUS
Vision Audio Haptic Med. Std. D. Med. Std. D.
C1 Monitor Off Off 75.83 13.82 32.50 14.34
C2 Monitor Off On 68.33 16.22 35.00 17.15
C3 Monitor On Off 70.83 15.00 30.00 17.76
C4 Monitor On On 71.66 14.93 35.00 15.63
C5 VRHMD Off Off 32.50 16.73 82.50 15.82
C6 VRHMD Off On 26.66 15.91 87.50 11.33
C7 VRHMD On Off 34.16 15.58 85.00 11.38
C8 VRHMD On On 26.66 14.52 92.50 13.20
TABLE III: Summary of all subjective results, reporting median and standard deviation across all eight interface conditions.

Vi-C Objective Results

Vi-C1 Error Rate

First, we analyzed the total proportion of successful task completion (%), across all interface conditions. Our sample was classified as a Bernoulli distribution and a two-times standard deviation from the mean, three-sigma rule, was used to test for significance. Results show that, interface conditions incorporating stereoscopic vision with the VRHMD (C5,C6,C7,C8) accounted to a significant observable difference, (), in mean success rates 96.22% (SD=4.73%), 99.11% (SD=2.62%), 96.22% (SD=5.94%), 97.55% (SD=4.26%) respectively compared to the monocular display monitor (C1,C2,C3,C4) with rates 39.33% (SD=21.69%), 47.55% (SD=18.77%), 51.33% (SD=23.63%), 48.22% (SD=20.26%) respectively. No significant differences were observed between conditions incorporating haptic or auditory feedback (). Results are depicted in Figure 7.

Fig. 7: Heat-map illustrating the proportion of task success rate going from lower to higher complexity, horizontal axis A.1.1 (left) to C.2.3 (right), across all the interface conditions C1 to C8, vertical axis.

Vi-C2 Placement and Completion Time

For time-based metrics we considered only the successful instances. Transforming our data in a non-parametric state, Shapiro-Wilk Test for normality yielded () in both instances. Friedman’s test was thus used, yielding a significant difference in mean placement as well as completion time across the eight interface conditions (, ) and (, ) respectively. Placement and completion times are shown in Table IV and Figure 8. Post-hoc analysis using the Wilcoxon Signed-Rank tests showed partial support of our H3 hypothesis, specifically (a) stereoscopic visual feedback with the VRHMD, C5,C6,C7 and C8 accounted to highly significantly less placement and completion time than with the monocular display monitor () C1, C2 C3 and C4, followed by (b) somatosensory feedback contributing additionally to significantly lesser placement and completion, however, only when paired with the VRHMD, C6 and C8 (. Auditory feedback (c), did not contribute to an observable difference across all conditions ().

Fig. 8:

Objective measurements represented in a bar graph in addition to standard error. From left to right, time-based metrics mean placement (opaque) and completion time (slightly transparent). Followed by spatial based metrics, specifically distance error, position and rotation accuracy.

Objective Measurements Placement Time [s] Completion Time [s] Distance Error [cm] Pos Accuracy (XYZ) [%] Rot Accuracy (XYZ) [%]
Vision Audio Haptics Mean Std. D. Mean Std. D. Mean Std. D. Mean Std. D. Mean Std. D.
C1 Monitor Off Off 14.27 3.65 16.45 3.91 19.12 6.42 27.89% 16.06 38.09% 21.35
C2 Monitor Off On 12.73 2.98 14.70 3.32 17.58 6.17 34.59% 14.30 47.37% 17.27
C3 Monitor On Off 13.14 3.83 15.05 3.42 16.15 8.06 36.50% 17.09 48.48% 19.86
C4 Monitor On On 13.00 3.55 15.88 3.15 17.80 8.61 35.22% 16.11 45.44% 16.97
C5 VRHMD Off Off 5.48 2.10 9.22 2.10 2.21 1.08 77.75% 5.93 87.55% 7.30
C6 VRHMD Off On 4.51 1.57 7.58 1.60 1.65 0.58 81.04% 6.08 89.08% 6.07
C7 VRHMD On Off 5.22 1.77 8.92 1.85 2.26 1.37 77.65% 7.75 87.27% 6.65
C8 VRHMD On On 4.47 1.64 7.80 1.71 1.76 0.70 80.97% 6.16 90.01% 5.67
TABLE IV: Summary of objective results including time-based and spatial-based metrics with mean and standard deviation across all interfaces.

Vi-C3 Distance Error

For distance-error to target, data was normally distributed, Shapiro-Wilk (

). As such, a one way RM-ANOVA with a Greenhouse-Geisser correction was used (), yielding a highly statistical significance across the eight interface conditions, (,,). Distance error across all interfaces is shown in Table IV and visually represented in Figure 8. Post-hoc analysis revealed a partial support of our H4 hypothesis, specifically (a) conditions incorporating stereoscopic vision with the VRHMD, C5, C6, C7,C8 accounted to significantly lower distance error (), compared to conditions incorporating monocular vision with the display monitor, C1,C2,C3,C4. Furthermore, (b) conditions incorporating somatosensory feedback, however only when paired with stereoscopic visual feedback, C6, C8, showed further significantly lower target error and by extent higher accuracy to the target placement (), than conditions without haptic feedback C5, C7 respectively. Finally, (c) conditions incorporating only auditory stimulation did not contribute to an observable difference in spatial accuracy than those without ().

Vi-C4 Position and Orientation Accuracy

Regarding spatial accuracy, specifically position and orientation accuracy, Shapiro-Wilk Test in both instances yielded () thus signifying normally distributed data. As such a one way RM-ANOVA with a Greenhouse-Geisser correction () and () respectively, yielded in both instances a highly significant difference (,,) and (,,) respectively. Position and orientation accuracy are shown in Table IV and visually represented in Figure 8. Post-hoc analysis revealed a full support of our H5 hypothesis, specifically (a) conditions incorporating stereoscopic vision with the VRHMD, C5, C6, C7,C8 accounted to significantly higher spatial accuracy both in position and orientation (), than conditions incorporating monocular vision with the display monitor, C1,C2,C3,C4. Furthermore, (b) conditions incorporating somatosensory feedback C2, C4 and C6, C8, showed further significantly higher spatial accuracy (), than those who do not C1, C3 and C5, C7 respectively. Finally, (c) conditions incorporating only auditory stimulation C3 did also cause a greater increase in spatial accuracy than those without (C1) (). Our findings here suggest that spatial accuracy increases significantly when stereo vision is used and furthermore when paired with either sound or somatosensory or even both, than just only relying on vision.

Fig. 9: Different participants during the manipulation experiment.

Vii Discussion

Our results are summarised as follows: the overall performance of users increased by around 40% by using stereoscopic vision with the VRHMD instead of monocular vision with the display monitor. Somatosensory feedback increased performance furthermore by 10% over all measurements as well. Auditory stimulation, however, had no significant effect on any measure apart from spatial accuracy, which increased by less than 5%.

These results provide evidence to the untested hypothesis of [53]. More specifically, our results show that an audiovisuohaptic interface, incorporating a stereoscopic VRHMD than a monocular monitor, contributes to the highest task performance, followed closely by visuohaptic and less closely by audiovisual interfaces. For a cone-like illustration of each interface effectiveness, that closely resemble the figures of [53], see the Appendices.

Our results support existing research that vision is the dominant sense [46, 30], outperforming all other senses [22]. As depth information is important in manipulation tasks, we can infer that better performance in VR may in part be due to the superior information available when using VRHMDs. This supports current literature [33, 65, 60].

Our results showed that less perceived cognitive workload was observed in the use of the VRHMD than in the monocular display. This contradicts previous work [9], but this may be attributed to significantly higher amounts of induced vection. Thus full conclusions cannot be drawn with our static scenario and further investigation is required to confirm. Our findings show that haptic feedback leads to better performance which is supported by some studies [7], but contradicts others [17]. The latter study found no significant effect of haptic feedback in a virtual throwing task. Since there are such a large number of options available for providing haptic feedback, findings may differ wildly simply by using a slightly different device. More research may be needed to investigate how small variations in the way haptic feedback is delivered, affects performance and a standardised device may be needed to compare the actual effect of haptics on humans.

The differences in the results for haptic devices may be partially explained by the "uncanny valley of haptics" [6]. This suggests that increasing the resolution of haptic feedback without the corresponding level of stimulation from other senses, will not contribute to a guaranteed increase in performance. Thus the resolution of all feedback interfaces has to be similar. Their study [6] used handheld controllers to deliver haptic feedback. We used a custom vibrotactile glove which has a higher resolution than the handheld controllers, but this only increased performance when the resolution of visual stimulation was increased as well by switching from the monocular display monitor to the stereoscopic VRHMD, thus supporting [6].

We found little evidence to show that auditory feedback has a positive impact on performance, though spatial accuracy did increase in the audiovisual condition compared to the visual condition (). Workload demand marginally decreased when auditory feedback was presented than just none at all, but not a significant level (). However, a significant difference was found in previous work [23]. It is possible that this was a bi-product of the increase in performance when switching from mono to stereo vision, potentially overshadowing the contribution of audio in the subjective performance of participants.

In both objective and subjective measures, the combination of stereoscopic visual feedback i.e. the VRHMD with the addition of audio and haptic feedback, Condition 8, provided the best performance overall. This supports our primary hypothesis. This is in line with existing literature, that adding more modalities is correlated to improved performance in manipulation scenarios [11]. Though, there was no significant difference in performance when using only two modalities: stereoscopic visual i.e. VRHMD and haptic feedback, Condition 6. However, we did see a marginal, but still significant drop in position and orientation accuracy in this condition, indicating that auditory did contribute to the effectiveness of spatial accuracy.

The main findings and design implications of our study include:

  • Adding additional modalities increases performance

  • Relying on just one modality should be avoided

  • Vision dominates, making the highest contribution in performance when enhancing from mono to stereo vision

  • Effectiveness of multi-modal interfaces is scenario-specific, this research explored it in the context of manipulation

  • Prioritization of visual, somatosensory and then auditory stimulation should be given for manipulation scenarios

  • Increasing task complexity lowers effectiveness as expected, but is not proportional for all multi-modal interfaces

  • Vibrotactile feedback can be considered as a low-cost somatosensory approach while more focus can be given on the design of vibrotactile intensity to compensate for the inherent lack of force-feedback

All of our hypotheses are summarized in Table V below, providing an overall overview of our findings.

Hypothesis Support Description
H1: Lower perceived workload
Partial (a) Y (b) P (c) N
H2: Higher system usability
Partial (a) Y (b) P (c) N
H3: Less task time
Partial (a) Y (b) P (c) N
H4: Less distance error
Partial (a) Y (b) P (c) N
H5: Higher placement precision
Full (a) Y (b) Y (c) Y
(a) Vision with stereoscopic VR-HMD than monoscopic monitor
(b) Haptic feedback than without (c) Sound feedback than without
*P: Partial; only effective when paired with stereo VR-HMD
TABLE V: Summary of Hypotheses support. Y: Yes, P: Partial, N: No.

Vii-a Design and Research Implications

Our low-cost haptic gloves show that expensive solutions are not required to achieve significant performance increases, in line with [6]. This may enable a wider range of research into haptic feedback and cost-effective multi-modal interfaces.

We also show that adding haptic feedback to monocular feedback has no significant effect on performance. However, adding haptic feedback to a VRHMD does improve performance significantly. This seems to be in line with the "uncanny valley of haptics" [6], which supports that it is not enough to add extra sensory modalities, but the resolution of these modalities must be similar. This is highlighted in Conditions 2 (visuohaptic) & Condition 4 (audiovisuohaptic), where monocular vision is used. In this case, the additional sensory modalities did not contribute to an observable difference in performance apart from spatial accuracy, possibly due to a mismatch in resolution between monocular vision and other modalities.

Priorities should be given when designing multi-modal interfaces for object manipulation. Our results support that researchers should aim to enhance visual stimuli before adding somatosensory feedback and lastly auditory.

Furthermore, based on our results, designers and researchers focusing on human performance in teleoperation, are encouraged to combine sensory interfaces as highlighted in this study. We observed that almost in all cases, bi-modal feedback i.e. visuohaptic and even more so audiovisuohaptic interfaces are significantly better performing than just relying on visual feedback. This may be even more the case for sensory channels that are already overloaded [34, 10], thus potentially opening more opportunities for researchers to investigate the effectiveness of such interfaces when channels are overloaded.

Vii-B Limitations and Future Work

The investigation of this research was focused on the contribution of the effectiveness of each sensory modality and combinations of these. However, we have not yet tested how auditory or somatosensory feedback would have compensated potentially overloaded visual information, which would have provided furthermore insight. Furthermore, we investigated the effectiveness of common visual feedback modalities i.e. the monitor display monitor and a VRHMD with their inherent capabilities. However, we did not explicitly and strictly investigated how monocular and stereoscopic visual feedback by themselves would influence performance. Future road-map would include using the VRHMD with either monocular or stereoscopic rendering. In addition, multi-modal design decisions are of paramount importance before implementing any kind of sensory feedback [53]. In our case, auditory feedback was implemented as the means of task indication and succession, instead of a continuous sonification i.e. concurrent type. Examples of concurrent auditory feedback would include controlling auditory pitch continuously based on target proximity, specific to manipulation tasks. Thus, further evidence may be needed on how not only different types of sensory feedback may influence task succession, but also how the design decisions of each sensory channel affect task efficiency. We did assume zero to minimal latency during our experiments, knowing that time delays are correlated to simulator sickness. This is a real-world problem in teleoperation and further aggravated in wireless technologies. Latency in our experiments was <15ms and thus its effect was not studied. However, in real-world applications, latency can become a problem that causes simulator sickness and is also a challenge in teleoperation where communication bandwidth is limited [38]. Within this study, by thoroughly comparing an audiovisuohaptic multi-modal interface, we have gained interesting insight on which modalities contribute to increased task performance, as long as time-delay is minimal.

Viii Conclusion

This paper explored how combining multiple sensory interfaces affects performance in manipulation tasks of varying complexity. Each combination of visual (monocular display monitor or a stereoscopic VRHMD), audio (with or without) and haptic (with or without) interface was tested. Task difficulty ranged from low to high by changing the size and shape of objects as well as distance to the target placement.

The performance was measured objectively and subjectively under experimental conditions. The results of these experiments showed a 40% increase in overall performance when using stereoscopic VRHMD visual feedback compared to a monocular display monitor. Somatosensory stimulation contributed a furthermore 10% increase in performance, while auditory feedback only increased spatial accuracy by an additional 5%.

Our evaluation found that by adding one more sensory modality in an interface is of a significant benefit than just relying on visual feedback. We thus conclude that task performance in teleoperation can be positively influenced by carefully selecting an appropriate combination of sensory feedback for a given task. As a result of this study, future researchers and designers should identify and prioritize certain modalities when designing multi-modal interfaces.


We would like to thank Prof. Taku Komura and Prof. Robert Fisher for approving the use of their facilities to accommodate the study. This project was funded by the EPSRC Future AI and Robotics for Space (EP/R026092/1).


In this appendix section, we summarize the overall interface effectiveness from our experiments. We visualise the overall findings of our results in Figure 10. In these figures, we visualise the overall effectiveness of each individual interface condition across all measurements and all tasks thus giving the final overview of our entire experimental results.

Fig. 10:

Overall interface effectiveness through linear regression, across all measurements and across all tasks with an increasing task complexity from lower to higher. Width of the shapes represents the effectiveness, the wider the higher. Colouring also indicates the effectiveness increasing from red to green. The overall effectiveness is calculated linearly, specifically, the measurements are weighted

where is the maximum limit of the measurement. The data points from the scatter plot have been line fitted through linear regression to visualize a cone-like illustration. The width of the cone represents the effectiveness while the height of the cone the effectiveness of the interface at the specific task complexity. The specific task complexity is discussed in section "Manipulation Tasks of Varying Complexity".


  • [1] A. A Ghazanfar and C. E Schroeder (2006-07) Is neocortex essentially multisensory?. Trends in cognitive sciences 10, pp. 278–85. External Links: Document Cited by: §II-B.
  • [2] H. Abdi (2010) The greenhouse-geisser correction. Encyclopedia of research design 1, pp. 544–548. External Links: Document Cited by: §VI-A.
  • [3] J. Aleotti, S. Bottazzi, and M. Reggiani (2002-10) A multimodal user interface for remote object exploration in teleoperation systems. pp. . External Links: Link Cited by: §II-A3.
  • [4] A. Amedi, R. Malach, T. Hendler, S. Peled, and E. Zohary (2001) Visuo-haptic object-related activation in the ventral visual pathway. Nature Neuroscience 4, pp. 324–330. External Links: Document Cited by: §II-B.
  • [5] J. B F van Erp and P. Padmos (2004-01) Image parameters for driving with indirect viewing systems. Ergonomics 46, pp. 1471–99. External Links: Document Cited by: §IV-B.
  • [6] C. Berger, M. Gonzalez-Franco, E. Ofek, and K. Hinckley (2018-04) The uncanny valley of haptics. Science Robotics 3, pp. eaar7010. External Links: Document Cited by: §II-A3, §II-B, §IV-B, §VII-A, §VII-A, §VII.
  • [7] D. Brickler, S. V. Babu, J. Bertrand, and A. Bhargava (2018-03) Towards evaluating the effects of stereoscopic viewing and haptic interaction on perception-action coordination. In 2018 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Vol. , pp. 1–516. External Links: Document, ISSN Cited by: §II-A3, §VII.
  • [8] J. Brooke et al. (1996) SUS-a quick and dirty usability scale. Usability evaluation in industry 189 (194), pp. 4–7. Cited by: §V-A.
  • [9] J. Brooks, R. Lodge, and D. White (2017) Comparison of a head-mounted display and flat screen display during a micro-uav target detection task. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 61 (1), pp. 1514–1518. External Links: Document, Link, Cited by: §II-A1, §VII.
  • [10] G. Burdea, P. Richard, and P. Coiffet (1996) Multimodal virtual reality: input output devices, system integration, and human factors. International Journal of Human Computer Interaction 8 (1), pp. 5–24. External Links: Document, Link, Cited by: §II-A, §VII-A.
  • [11] J. L. Burke, M. S. Prewett, A. A. Gray, L. Yang, F. R. B. Stilson, M. D. Coovert, L. R. Elliot, and E. Redden (2006) Comparing the effects of visual-auditory and visual-tactile feedback on user performance: a meta-analysis. In Proceedings of the 8th International Conference on Multimodal Interfaces, ICMI ’06, New York, NY, USA, pp. 108–117. External Links: ISBN 1-59593-541-X, Link, Document Cited by: §II-A4, §VII.
  • [12] J. Y. C. Chen, E. C. Haas, and M. J. Barnes (2007-11) Human performance issues and user interface design for teleoperated robots. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 37 (6), pp. 1231–1245. External Links: Document, ISSN 1094-6977 Cited by: §I, §II-A.
  • [13] I. Choi, E. W. Hawkes, D. L. Christensen, C. J. Ploch, and S. Follmer (2016) Wolverine: a wearable haptic interface for grasping in virtual reality. 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). External Links: Document Cited by: §II-A3.
  • [14] R. W. Cholewiak and A. A. Collins (2000-09-01) The generation of vibrotactile patterns on a linear array: influences of body site, time, and presentation mode. Perception & Psychophysics 62 (6), pp. 1220–1235. External Links: ISSN 1532-5962, Document, Link Cited by: §II-A3.
  • [15] B. P. DeJong, J. E. Colgate, and M. A. Peshkin (2004-04) Improving teleoperation: reducing mental rotations and translations. In IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA ’04. 2004, Vol. 4, pp. 3708–3714 Vol.4. External Links: Document, ISSN 1050-4729 Cited by: §I.
  • [16] T. Erni and V. Dietz (2001) Obstacle avoidance during human walking: learning rate and cross-modal transfer. The Journal of physiology 534 (1), pp. 303–312. Cited by: §II-A2.
  • [17] E. Frid, J. Moll, R. Bresin, and E. Sallnäs Pysander (2018-05-09) Haptic feedback combined with movement sonification using a friction sound improves task performance in a virtual throwing task. Journal on Multimodal User Interfaces. External Links: ISSN 1783-8738, Document, Link Cited by: §II-A4, §VII.
  • [18] M. Friedman (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association 32 (200), pp. 675–701. External Links: Document, Link, Cited by: §VI-A.
  • [19] F. Gemperle, N. Ota, and D. Siewiorek (2001-10) Design of a wearable tactile display. In Proceedings Fifth International Symposium on Wearable Computers, Vol. , pp. 5–12. External Links: Document, ISSN 1530-0811 Cited by: §II-A3.
  • [20] X. Gu, Y. Zhang, W. Sun, Y. Bian, D. Zhou, and P. O. Kristensson (2016) Dexmo: an inexpensive and lightweight mechanical exoskeleton for motion capture and force feedback in vr. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, New York, NY, USA, pp. 1991–1995. External Links: ISBN 9781450333627, Link, Document Cited by: §II-A3.
  • [21] S. G. Hart and L. E. Staveland (1988) Development of nasa-tlx (task load index): results of empirical and theoretical research. Advances in Psychology Human Mental Workload, pp. 139–183. External Links: Document Cited by: §V-A.
  • [22] M. L. Heilig (1992) EL cine del futuro: the cinema of the future. Presence: Teleoperators & Virtual Environments 1, pp. 279–294. Cited by: §II-A1, §II-A, §VII.
  • [23] Y. N. K. T. Iida (1999) Audio feedback system for engineering test satellite vii. Vol. 3840. External Links: Document, Link, Cited by: §II-A2, §VII.
  • [24] W.A. IJsselsteijn, H. Ridder, de, J. Freeman, and S.E. Avons (2000) Presence : concept, determinants and measurement. In Human Vision and Electronic Imaging V, January 24-27, 2000, San Jose, USA, B.E. Rogowitz and T.N. Pappas (Eds.), Proceedings of SPIE, United States, pp. 520–529 (English). External Links: ISBN 0-8194-3577-5, Document Cited by: §V.
  • [25] T. James, G. Keith Humphrey, S. Gati, P. Servos, R. S Menon, and M. Goodale (2002-02) Haptic study of three-dimensional objects activates extrastriate visual areas. Neuropsychologia 40, pp. 1706–14. External Links: Document Cited by: §II-B.
  • [26] C. Jennett, A. L. Cox, P. Cairns, S. Dhoparee, A. Epps, T. Tijs, and A. Walton (2008-09) Measuring and defining the experience of immersion in games. Int. J. Hum.-Comput. Stud. 66 (9), pp. 641–661. External Links: ISSN 1071-5819, Link, Document Cited by: §I.
  • [27] S. Johnson, I. Rae, B. Mutlu, and L. Takayama (2015) Can you see me now?: how field of view affects collaboration in robotic telepresence. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI ’15, New York, NY, USA, pp. 2397–2406. External Links: ISBN 978-1-4503-3145-6, Link, Document Cited by: §II-A1.
  • [28] B. Keshavarz, B. Riecke, L. Hettinger, and J. Campos (2015-04) Vection and visually induced motion sickness: how are they related?. Frontiers in psychology 6, pp. 472. External Links: Document Cited by: §II-A1.
  • [29] M. Kim, C. Jeon, and J. Kim (2017-05) A study on immersion and presence of a portable hand haptic system for immersive virtual reality. Sensors 17 (5), pp. 1141. External Links: ISSN 1424-8220, Link, Document Cited by: §IV-B, §IV-E3.
  • [30] R. L. Klatzky, J. M. Loomis, A. C. Beall, S. S. Chance, and R. G. Golledge (1998) Spatial updating of self-position and orientation during real, imagined, and virtual locomotion. Psychological Science 9 (4), pp. 293–298. External Links: ISSN 09567976, 14679280, Link Cited by: §II-A1, §VII.
  • [31] D. A. Kontarinis and R. D. Howe (1995-01) Tactile display of vibratory information in teleoperation and virtual environments. Presence: Teleoper. Virtual Environ. 4 (4), pp. 387–402. External Links: ISSN 1054-7460, Link, Document Cited by: §IV-B.
  • [32] V. Kumar and E. Todorov (2015-11) MuJoCo haptix: a virtual reality system for hand manipulation. In 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), Vol. , pp. 657–663. External Links: Document, ISSN Cited by: §II-B, §IV-F.
  • [33] D. R. Lampton, D. P. McDonald, M. Singer, and J. P. Bliss (1995) Distance estimation in virtual environments. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 39 (20), pp. 1268–1272. External Links: Document, Link, Cited by: §I, §II-A1, §VII.
  • [34] C. E. Lathan and M. Tracey (2002-08) The effects of operator spatial perception and sensory feedback on human-robot teleoperation performance. Presence: Teleoper. Virtual Environ. 11 (4), pp. 368–377. External Links: ISSN 1054-7460, Link, Document Cited by: §II-A4, §II-A, §VII-A.
  • [35] C. M. Light, P. H. Chappell, and P. J. Kyberd (2002) Establishing a standardized clinical assessment tool of pathologic and prosthetic hand function: normative data, reliability, and validity. Archives of Physical Medicine and Rehabilitation 83 (6), pp. 776–783. External Links: Document Cited by: §II-B, §IV-F1, §IV-F2, §IV-F3, §IV-F.
  • [36] H. Martins and R. Ventura (2009) Immersive 3-d teleoperation of a search and rescue robot using a head-mounted display. 2009 IEEE Conference on Emerging Technologies & Factory Automation. External Links: Document Cited by: §I, §II-A1.
  • [37] M. J. Massimino and T. B. Sheridan (1993) Sensory substitution for force feedback in teleoperation. Presence: Teleoperators and Virtual Environments 2 (4), pp. 344–352. External Links: Document, Link, Cited by: §IV-B.
  • [38] S. A. McGlynn and W. A. Rogers (2017) Considerations for presence in teleoperation. In Proceedings of the Companion of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’17, New York, NY, USA, pp. 203–204. External Links: ISBN 9781450348850, Link, Document Cited by: §VII-B.
  • [39] J. P. McIntire, P. R. Havig, and E. E. Geiselman (2014) Stereoscopic 3d displays and human performance: a comprehensive review. Displays 35 (1), pp. 18 – 26. External Links: ISSN 0141-9382, Document, Link Cited by: §II-A1.
  • [40] M. Mori, K. MacDorman, and N. Kageki (2012-06) The uncanny valley [from the field]. IEEE Robotics & Automation Magazine 19, pp. 98–100. External Links: Document Cited by: §II-B.
  • [41] A. M. Murray, R. L. Klatzky, and P. K. Khosla (2003-04) Psychophysical characterization and testbed validation of a wearable vibrotactile glove for telemanipulation. Presence: Teleoper. Virtual Environ. 12 (2), pp. 156–182. External Links: ISSN 1054-7460, Link, Document Cited by: §II-A3.
  • [42] G. V. Popescu, G. C. Burdea, and H. Trefftz (2002) Multimodal interaction modeling. Handbook of virtual environments: Design, implementation, and applications, pp. 435–454. Cited by: §II-A, §II-A.
  • [43] F. Pukelsheim (1994) The three sigma rule. The American Statistician 48 (2), pp. 88–91. External Links: ISSN 00031305, Link Cited by: §VI-A.
  • [44] P. Richard, G. Burdea, D. Gomez, and P. Coiffet (1994) A comparison of haptic, visual and auditive force feedback for deformable virtual objects. In Proceedings of the Internation Conference on Automation Technology (ICAT), Vol. 49, pp. 62. Cited by: §II-A.
  • [45] J. L. Rochlis and D. J. Newman (2000) A tactile display for international space station (iss) extravehicular activity (eva).. Aviation, space, and environmental medicine 71 (6), pp. 571–578. Cited by: §II-A3.
  • [46] I. Rock and J. Victor (1964) Vision and touch: an experimentally created conflict between the two senses. Science 143 (3606), pp. 594–596. External Links: Document, ISSN 0036-8075, Link, Cited by: §II-A1, §VII.
  • [47] L. B. Rosenberg (1993-Sep.) The effect of interocular distance upon operator performance using stereoscopic displays to perform virtual depth tasks. In Proceedings of IEEE Virtual Reality Annual International Symposium, Vol. , pp. 27–32. External Links: Document, ISSN Cited by: §II-A1.
  • [48] K. Sathian and A. Zangaladze (2002-10) Feeling with the mind’s eye: contribution of visual cortex to tactile perception. Behavioural brain research 135, pp. 127–32. External Links: Document Cited by: §II-B.
  • [49] D. R. Scribner and J. W. Gombash (1998) The effect of stereoscopic and wide field of view conditions on teleoperator performance. Technical report Army Research Lab Aberdeen Proving Ground MD Human Research and Engineering. External Links: Link Cited by: §II-A1, §II-A1.
  • [50] R. Secoli, M. Milot, G. Rosati, and D. J. Reinkensmeyer (2010) Effect of visual distraction and auditory feedback on patient effort during robot-assisted movement training after stroke. In Journal of NeuroEngineering and Rehabilitation, External Links: Document Cited by: §II-A2.
  • [51] X. Shang, M. Kallmann, and A. S. Arif (2019) Effects of correctness and suggestive feedback on learning with an autonomous virtual trainer. In Proceedings of the 24th International Conference on Intelligent User Interfaces: Companion, IUI ’19, New York, NY, USA, pp. 93–94. External Links: ISBN 978-1-4503-6673-1, Link, Document Cited by: §II-A1.
  • [52] R. D. Shilling and B. Shinn-Cunningham (2002) Virtual auditory displays. In Handbook of Virtual Environments, pp. 105–132. Cited by: §II-A2.
  • [53] R. Sigrist, G. Rauter, R. Riener, and P. Wolf (2013-02-01) Augmented visual, auditory, haptic, and multimodal feedback in motor learning: a review. Psychonomic Bulletin & Review 20 (1), pp. 21–53. External Links: ISSN 1531-5320, Document, Link Cited by: §I, §II-A2, §II-A4, §II-A, §VI-A, §VII-B, §VII.
  • [54] B. D. Simpson, R. S. Bolia, and M. H. Draper (2013) Spatial audio display concepts supporting situation awareness for operators of unmanned aerial vehicles. Human Performance, Situation Awareness, and Automation: Current Research and Trends HPSAA II, Volumes I and II 2, pp. 61. Cited by: §II-A2.
  • [55] M. Sinclair, E. Ofek, M. Gonzalez-Franco, and C. Holz (2019) CapstanCrunch: a haptic vr controller with user-supplied force feedback. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology, UIST ’19, New York, NY, USA, pp. 815–829. External Links: ISBN 9781450368162, Link, Document Cited by: §II-A3.
  • [56] M. Slater, M. Usoh, and A. Steed (1994-01) Depth of presence in virtual environments. Presence: Teleoper. Virtual Environ. 3 (2), pp. 130–144. External Links: ISSN 1054-7460, Link, Document Cited by: §V.
  • [57] C. C. Smyth (2000) Indirect vision driving with fixed flat panel displays for near unity, wide, and extended fields of camera view. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 44 (36), pp. 541–544. External Links: Document, Link, Cited by: §II-A1.
  • [58] R. J. Stone (2001) Haptic feedback: a brief history from telepresence to virtual reality. In Haptic Human-Computer Interaction, S. Brewster and R. Murray-Smith (Eds.), Berlin, Heidelberg, pp. 1–16. External Links: ISBN 978-3-540-44589-0 Cited by: §I, §II-A3, §IV-E3.
  • [59] Y. Suzuki and M. Kobayashi (2005-01) Air jet driven force feedback in virtual reality. IEEE Computer Graphics and Applications 25 (1), pp. 44–47. External Links: Document, ISSN 1558-1756 Cited by: §IV-B.
  • [60] J. E. Swan, G. Singh, and S. R. Ellis (2015-11) Matching and reaching depth judgments with real and augmented reality targets. IEEE Transactions on Visualization and Computer Graphics 21 (11), pp. 1289–1298. External Links: Document, ISSN 2160-9306 Cited by: §II-A1, §VII.
  • [61] S. Tachi, K. Komoriya, K. Sawada, T. Nishiyama, T. Itoko, M. Kobayashi, and K. Inoue (2003) Telexistence cockpit for humanoid robot control. Advanced Robotics 17 (3), pp. 199–217. External Links: Document, Link, Cited by: §II-A2.
  • [62] V. Vuskovic, M. Kauer, G. Szekely, and M. Reidy (2000-04) Realistic force feedback for virtual reality based diagnostic surgery simulators. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065), Vol. 2, pp. 1592–1598 vol.2. External Links: Document, ISSN 1050-4729 Cited by: §II-A3.
  • [63] B. Weber, M. Sagardia, T. Hulin, and C. Preusche (2013) Visual, vibrotactile, and force feedback of collisions in virtual environments: effects on performance, mental workload and spatial orientation. In Virtual Augmented and Mixed Reality. Designing and Developing Augmented and Virtual Environments, R. Shumaker (Ed.), Berlin, Heidelberg, pp. 241–250. External Links: ISBN 978-3-642-39405-8, Document Cited by: §IV-B.
  • [64] M. Wellner, A. Schaufelberger, J. v. Zitzewitz, and R. Riener (2008) Evaluation of visual and auditory feedback in virtual obstacle walking. Presence: Teleoperators and Virtual Environments 17 (5), pp. 512–524. Cited by: §II-A2.
  • [65] B. G. Witmer and P. B. Kline (1998-04) Judging perceived and traversed distance in virtual environments. Presence: Teleoper. Virtual Environ. 7 (2), pp. 144–167. External Links: ISSN 1054-7460, Link, Document Cited by: §II-A1, §VII.
  • [66] J. O. Wobbrock, L. Findlater, D. Gergle, and J. J. Higgins (2011) The aligned rank transform for nonparametric factorial analyses using only anova procedures. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, New York, NY, USA, pp. 143–146. External Links: ISBN 978-1-4503-0228-9, Link, Document Cited by: §VI-A.
  • [67] H.a. Yanco and J. Drury (2004) "Where am i?" acquiring situation awareness using a remote robot platform. 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583), pp. 2835–2840. External Links: Document Cited by: §I.
  • [68] V. Yem, K. Vu, Y. Kon, and H. Kajimoto (2018-03) Effect of electrical stimulation haptic feedback on perceptions of softness-hardness and stickiness while touching a virtual object. In 2018 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Vol. , pp. 89–96. External Links: Document, ISSN Cited by: §II-A3.