Audio-Visual Spatial Aligment Requirements of Central and Peripheral Object Events

by   Davide Berghi, et al.
University of Surrey

Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were selected to represent three acoustic feature classes. Subjective test results in a simulated virtual environment from 18 participants indicate a wider capture region in the periphery, with an outward bias favoring more lateral sounds. Centered stimulus results support previous findings for simpler scenes.



There are no comments yet.


page 2


Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation

There are rich synchronized audio and visual events in our daily life. I...

Mobile Sound Recognition for the Deaf and Hard of Hearing

Human perception of surrounding events is strongly dependent on audio cu...

Look, Listen, and Act: Towards Audio-Visual Embodied Navigation

A crucial aspect of mobile intelligent agents is their ability to integr...

Effect of acoustic scene complexity and visual scene representation on auditory perception in virtual audio-visual environments

In daily life, social interaction and acoustic communication often take ...

A proto-object based audiovisual saliency map

Natural environment and our interaction with it is essentially multisens...

Static Visual Spatial Priors for DoA Estimation

As we interact with the world, for example when we communicate with our ...

Visual spatial learning of complex object morphologies through interaction with virtual and real-world data

Conceptual design relies on extensive manipulation of morphological prop...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background: The Ventriloquism Effect

The human brain tends to perceive audio and visual signals as a unified object even in the presence of a spatial mismatch between the locations of their sources, as long as the spatial misalignment is small enough. This illusion is known as the ventriloquism effect (VE) [Alais:2004:VE]. The strength of the VE is influenced by several factors, such as the unimodal localization precision or the typology of audio-visual stimuli. Human audio-visual localization has proven to behave differently when the stimuli are presented peripherally [BongLee:2015:APV]. Stenzel et al. [Stenzel:2018:PTC] observed significant variations comparing stimuli with different acoustic features. Additionally, studies by Kytö et al. [Kyto:2015:VEV] found that the VE could extend even farther when test participants are immersed in an AR scenario. Komiyama [Komiyama:1989] and Stenzel et al. [Stenzel:2018:PTC] found that the participant’s musical preparation influences one’s ability to detect the spatial misalignment between audio and visual stimuli: with musically untrained participants, the size of the VE approaches double that achieved with trained participants. Discrete object events in VR applications are not only presented straight ahead but may also appear in peripheral areas. The current research is motivated by the need to define the differences in the VE between central and peripheral stimuli presentation, taking into account the variations induced by the type of stimulus and the participant training. The visual stimuli employed in the tests are a 3D reconstruction of the items utilized by Stenzel et al. [Stenzel:2018:PTC] and were presented while the participants were immersed within the projection of a wide curved virtual environment.

2 Methods

The methods presented aim to assess the VE along the azimuth direction. A yes-no forced-choice test was adopted to assess whether the presented audio-visual spatial offset was perceivable or not. The data collected from the subjective tests have been interpolated using the psychometric function (PF)


, which relates the strength of a stimulus to the probability of its correct classification. The offset angle at which 50% of the responses agree on classifying the stimuli as coherent is called the point of subjective equality (PSE), i.e the strength of the VE. The PF typically takes the shape of a Sigmoid function, normalized in the range [0,1], where 1 represents the absence of the tested attribute, i.e. when the stimuli are spatially aligned. The PF proposed by Wichmann

et al. [Wichmann:2001:PFF] is given by


where is the stimulus strength, the overall curve’s position, the slope,  the guess rate which represents the lower bound of the curve, and the the lapse rate, i.e. the responses given regardless of the stimulus intensity.

Dataset. The dataset employed in the experiment is made up of 9 audio-visual stimuli. Each audio-visual stimulus consists of an audio clip and a volumetric video sequence made up of 3D geometry and a UV texture atlas at each time instance. Volumetric video sequences were captured in a multi-camera studio comprised of a 16-cameras set up in an inward facing 360° configuration. The dataset was partitioned into audio feature classes: “Continuous” sounds, “Harmonic” sounds, and “Discrete” sounds, in order to study how different acoustic properties influence the VE.

Experimental Design. With the aid of two projectors, the visual stimuli were projected on a white, curved and acoustically transparent screen covering approximately 150° in azimuth of the Surrey Sound Sphere. During the experiment, the VR scenario was constantly projected, and the 3D objects were presented randomly either at 0°, +41.2° or -41.2° of the participants’ head-forward direction. The related audio signal was played time-synchronously with the visual object from one of the loudspeakers located behind the screen in the neighborhood of the visual position, as highlighted in Figure 1.The experiment was composed of a total of 288 stimuli presentations. The 3D objects were located within the virtual environment in order to be aligned with the three zero-offset loudspeakers. Before each stimulus presentation, a circular target was projected in the center of the screen in order to focus the participant’s sight centrally. Participants were asked to avoid head movements, yet they were permitted to re-direct their gaze toward the foreground visual stimulus.

Figure 1: Loudspeaker positions overlaid on the projected virtual environment. Blue, yellow and red colors denote loudspeakers employed in audio stimulus reproduction.Division of the field of vision into central, inward and outward peripheral areas for analysis.
Central Inward Outward
C 0.59 7.2 9.0° 0.55 6.6 9.7° 0.58 7.2 11.4°
D 0.52 5.0 10.6° 0.55 4.6 10.2° 0.42 5.2 15.7°
H 0.52 5.2 10.5° 0.50 5.4 11.1° 0.50 5.1 13.8°
All 0.55 5.7 10.0° 0.53 5.6 10.4° 0.52 5.8 13.2°
Table 1: Estimated PF parameters and PSEs for Discrete (D), Continuous (C), and Harmonic (H) audio feature classes.

3 Results

The parameters and were estimated for different combinations of participant, audio-visual position, and stimulus levels. The audio-visual domain was divided into “central”, “inward”, and “outward” positions as highlighted in the bottom part of Figure 1. In a final step, a Gaussian interpolation of the peripheral responses was performed to estimate the perceived coincidence angles.

Visual position.

To determine the effect of the visual positions on the PSEs, an analysis of variance (ANOVA) was conducted. The results show that the visual position and the participants have a significant effect on the PSEs but their interaction was not significant (

). A post-hoc comparison using the Tukey-HSD test indicated significant differences between the outward position and both inward () and central positions (). These results are mirrored in the overall PSEs, with a PSE of 10.0° for the central position, 10.4° for the inward position, and 13.2° for outward audio-visual offsets.

Audio feature classes. Secondly, the PSEs were estimated for each item and visual position combination. The PSEs measured in the central field of vision range from 7.8° to 13.0°, for peripheral-inward offsets from 7.8° to 13.3°, while for outwards offsets from 10.4° to 18.5°. The ANOVA analysis revealed that both visual position and audio feature class influence the PSEs significantly. A further Tukey-HSD post-hoc analysis on the audio feature classes revealed a significant distinction between the Continuous sounds and the other two groups ( for both comparisons).

Coincidence angle. A Gaussian interpolation outlines an overall shift of 1.4° outwards with respect to the visual stimulus position. The size of the outward shift varies across the items, up to a maximum of 3.3°. The Discrete sounds class produced the greatest shift (3°), whereas it was smallest for Continuous sounds (0.8°).

Trained vs. Untrained. The PF was re-estimated separately for trained and untrained participants per visual position. Results show that the PSEs increase by 60%-100% for untrained participants for each position group. The Gaussian interpolations revealed an outward shift in the coincidence angle of 0.9° and 2.6° for musically trained and untrained participants respectively.

Figure 7: The PFs estimated per audio feature class AFC, for (a) the central position, (b) the inward position, and (c) the outward position. (d) the Gaussian interpolation per AFC and (e) the loudspeakers set up with overall subjective responses.

4 Discussion & Conclusion

It was shown that the size of the VE increased at peripheral presentations and was significantly larger outward in the periphery. The mean central PSE occurred at 10° offset. In the periphery, the inward offset at PSE was slightly larger, whereas the outward offset increased to 13°. Such an increment is reflected in an outward shift of the perceived coincidence angle. Continuous sounds produced the smallest PSEs in both central and peripheral stimuli; Discrete sounds resulted in the greatest shift. In all positions, ventriloquism had a stronger overall effect on untrained participants (15°) than trained (9°). This effect was less marked for the inward periphery, consistent with a greater lateral bias of the perceived auditory location inferred from the coincidence angles: (visual 41.2°) trained 42.1°, untrained 43.8°. Further tests can study more complex ecological scenes.