I From 3DoF to 6DoF Object-Based Audio
I-a Recent Developments in 3-Degree-of-Freedom Media
Spatial audio reproduction technology has developed since the middle of the 20th century from mono to stereo to surround to immersive . George Lucas once said that audio accounts for “half the experience” when it is implemented in conjunction with visual reproduction in the cinema . More recently, thanks to the adoption of new immersive audio formats in the movie industry, it has become possible to achieve a greater degree of spatial coincidence between the on-screen visual cues and the audio localization cues delivered via loudspeakers to a fixed (seated) spectator in a theater.
These new multi-channel audio formats are also employed in the creation of cinematic Virtual Reality (VR) experiences and their delivery via Head-Mounted Display devices (HMD) which incorporate head-tracking technology enabling 3-degree-of-freedom (3DoF) experiences in which movements in the spectator’s head orientation are compensated for in both the visual and audio presentations. These new forms of linear media have begun introducing mainstream consumers and the entertainment industry to immersive and positional audio concepts, and to new capabilities over previous stereo or multi-channel audio formats [67, 28, 66]:
extended positional audio playback resolution and coverage (including above or below the horizontal plane), as well as spatially diffuse reverberation rendering
flexible loudspeaker playback configurations, and support for encoding and rendering an audio scene in High-Order Ambisonic format (HOA) 
ability to substitute or modify selected elements of the audio mix (encoded as audio objects) for content customization such as language localization or hearing enhancement .
Some limitations persist. For instance, it is not practical, in current 3DoF immersive audio formats, to substitute objects that carry spatially diffuse reverberation or to allow free listener navigation ”inside the audio mix” at playback time. Essentially, a 3DoF audio object represents a virtual loudspeaker oriented towards the listener and allowed variable position on a surrounding surface, as illustrated in Fig. 1.
I-B Towards the 6-Degree-of-Freedom Audio Metaverse
In interactive Virtual Reality or Augmented Reality (AR) experiences, the audio-visual presentation must also account for possible translations of the user’s position along up to three coordinate axes , simulating 6-Degree-of-Freedom (6DoF) navigation inside a virtual world, hereafter referred to as the Metaverse [77, 8]. Like the physical world, the Metaverse may be modeled, from a human perception and cognition standpoint, as a collection of audio-visual objects (each a sound source and/or a light emitter or reflector) coexisting in an environment that manifests itself to the spectator through ambient lighting and acoustic reflections or reverberation.
Unlike 3DoF experiences, 6DoF scenarios may allow a user to walk closer to an object or around it, or may allow an object to move freely in position or orientation (in other words, audio objects may also be allowed 6 degrees of freedom). The acoustic properties of an audio object should include its directivity pattern, so that sound intensity and spectral changes may be simulated convincingly upon variation in its orientation relative to the position of the listener. Existing 6DoF scene representation formats and technologies for authoring and rendering navigable virtual worlds include VRML (for web applications), MPEG-4 (for synchronized multimedia delivery), and game engines[80, 73, 75]. New standards under development – such as OpenXR, WebXR, and MPEG-I – address modern VR and AR content and devices, including support for listener navigation between multiple Ambisonic scene capture locations [79, 85, 64, 56].
I-C The Coming-of-Age of Wearable 3D Audio Appliances
On the device side, several recent technology innovations accelerate the deployment of 6DoF audio technology :
New 3DoF listening experiences and devices have begun introducing mainstream consumers to head-tracking technology (as an enhancement beyond traditional ”head-locked” headphone listening experiences), with or without coincident visual playback.
The recent introduction of low-latency wireless audio transmission technology enables “unleashed” mobile listening experiences.
Acoustically transparent binaural audio playback devices enable AR audio reproduction  – including hear-through earbuds, bone conduction headsets, audio-only glasses, and audio-visual HMDs.
In traditional headphone audio reproduction, binaural processing is often considered to be an optional function, aspiring to “help the listener forget that they are wearing headphones” by delivering a natural, externalized listening experience . Breaking from this legacy, the new generation of non-intrusive wireless wearable playback devices requires binaural 3D audio processing technology, as a necessity to ensure that the audio playback system remains inconspicuous .
I-D Shared Immersive Experiences in the Audio Metaverse
Categories of applications facilitated by this technology evolution include:
Social co-presence, including virtual meetings, remote collaboration, and virtual assistants
Assisted training, education, diagnosis, surgery, design, or architecture
Virtual travel, including virtual navigation or tourism, and situational awareness
Immersive entertainment, including games, music, and virtual concerts .
In some of these use cases, as illustrated in Fig. 2, a Metaverse portion or layer may be experienced simultaneously by multiple connected users navigating independently or actively engaged in communication or collaboration. Additionally, portions of the object or environment information may be served or updated at runtime by several independent applications employing a common syntax and protocol to transmit partial scene description data to be parsed, consolidated and processed by each local rendering device.
A general, versatile audio technology solution supporting these applications should satisfy both “you are there” experiences (VR, teleportation to a virtual world) and “they are here” use cases (AR, digitally extending the user’s physical world without obstructing their environment). From an ecological point of view, an aim of the technology is to suspend disbelief: alleviate the user’s cognitive effort by matching implicit expectations derived from natural experience. This translates, in particular, to the requirement of maintaining consistency between the audio and visual components of the scene experienced by the user (audio/visual congruence) and between its virtual elements and the real world as experienced previously or concurrently (virtual/real congruence).
I-E Environmental Audio for VR and AR
With regards to the modeling of acoustic reverberation and sound propagation, an application programming solution for the Audio Metaverse should meet the following criteria:
In VR applications, environment reverberation properties should be tunable by a sound designer via perceptually meaningful parameters, thereby extending the creative palette beyond the constraints and limitations inherent in physics-based environment simulation.
In AR applications, the virtual acoustic scene representation should be adaptable to match or incorporate local environment parameters detected at playback time, including room reverberation attributes as well as the geometrical configuration and physical properties of real or virtual acoustic obstacles or reflectors.
Accordingly, the following capabilities should be enabled by a versatile programming and rendering solution:
At creation: specify a collection of virtual sound sources and the intrinsic properties of each, including its acoustic directivity and a sound generation method (e.g. voice communication, procedural sound synthesis, etc. )
At runtime: according to application scenario and events, set source and listener positions and orientations, reverberation properties and acoustic simulation parameters to match local or virtual environment conditions.
I-F Spatial Audio Rendering Technology for the Metaverse
Binaural 3D audio rendering technology is essential in 6DoF virtual experiences mediated by wearable or head-mounted displays. In this paper, the requirements to be met by a successful solution will be broken down as follows:
Binaural rendering: perceptually grounded audio signal processing engine designed to allocate rendering computation resources towards the minimum viable footprint, scalable to wearable or battery-powered devices (see methods reviewed in Section III of this paper). Audio fidelity should be predictable and maximally consistent across platforms or conditions. The level-of-detail prioritization strategy should be based on perceptual significance ranking of the audio features.
Application development facilitated by distinguishing two levels of Application Programming Interface (API):
1) Low-level audio rendering API reflecting the above psycho-acoustically based requirements. In Section II, we examine the progressive stages of elaboration of a practical and generic solution developed for this purpose, extending PC and mobile game audio standards.
2) High-level scene description API distilled to the critical functional features required to optimize development efficiency. In Section IV, we review practical acoustic propagation models and strategies which enable, from high-level scene description data, the derivation of the low-level audio rendering API parameters that control the digital audio scene delivery to each user.
Fig. 3 illustrates the proposed functional composition of the overall audio rendering system. By collecting this information in the present synopsis paper, we aim to promote cooperation towards the development of a practical multi-application, multi-platform Audio Metaverse data model coupled with an open acoustic rendering API, mature for industry-wide standardization to realize the promise of 6DoF spatial audio technology innovation outlined above in Section I-D.
Ii Audio Rendering API for the Metaverse
Computer-generated audio reproduction technology for VR and AR leverages decades of signal processing innovation in interactive audio rendering systems and application programming interfaces – building and extending upon prior developments in the fields of computer music and architectural acoustics (including binaural techniques, artificial reverberation algorithms, physical room acoustics modeling and auralization [19, 48, 44, 62]).
We begin this section with an overview of early developments in interactive audio standards for PC and mobile gaming, leading up to the OpenAL EFX API . We then highlight some of the new enhancements incorporated in Magic Leap’s Soundfield Audio API (MSA), targeting today’s AR and VR applications [76, 5]. Table I summarizes the evolution in successive capability extensions among these 6DoF object-based audio rendering APIs.
|Audio rendering API features||(3DoF)||I3DL1, DSound||I3DL2, EAX2||OpenAL EFX||MSA|
|Direction of arrival (per-object)|
|Source orientation (per-object)|
|Source directivity (per-object)|
|Filtering effects (per-object)|
|Listener’s room (global)|
|Multiple rooms (global)|
|Clustered reflections (per-object)|
|Control frequencies (global)||,||, ,|
Ii-a Interactive Audio Standards for PC and Mobile Gaming
Technology standards were created in the 1990s for PC audio, and later for mobile audio, to address the demand for interoperable hardware platforms and 3D gaming software applications, comparable in nature to the need that motivates present VR/AR industry initiatives such as OpenXR . Shared API standards were developed for graphics rendering (notably OpenGL) and, simultaneously, for audio rendering – including I3DL2, EAX and OpenAL [31, 34, 29].
Ii-A1 Positional audio properties
the position and the orientation of the listener and of each sound source, and a per-source gain correction
automatic relative attenuation of sound as a function of source-listener distance beyond a , customizable via a property
Ii-A2 Environmental audio extensions (single room)
I3DL2 and EAX 2.0 (1999), and then OpenAL EFX and OpenSL ES, included properties describing an ”environment” (room) in which the listener and sound sources coexist. They decompose the rendered audio signal for each source into direct sound, early reflections and late reverberation, and expose the following features [31, 34, 57, 55]:
reference frequency where high-frequency attenuations and the property are specified
per-source control of gain and high-frequency attenuation for and (i.e. jointly Reflections and Reverb)
spectral extension of the source radiation model: high-frequency attenuation controlling a low-pass filter effect dependent on source orientation (more details in Section II-B2)
globally defined air absorption model controlling a distance-dependent low-pass filter effect
extension of the distance-based model to automatically apply a relative attenuation to the reflections and reverberation for distances beyond the , including a per-source property and a physical simulation mode (see Section II-B4)
sound muffling effects for simulating sound transmission through a room partition () or through/around an obstacle inside the room ().
Ii-A3 Multi-room extensions
selecting which of these four reverberators simulates the listener’s environment at the current time
separately controlled per-source contributions into several of these reverberators (see Fig. 6)
reverberator output occlusion and diffuse panning model (illustrated in Fig. 7)
3-band control of spectral effects using two reference frequencies, and (see Section II-B1).
Ii-B From OpenAL to Magic Leap Soundfield Audio (MSA)
OpenAL and its EFX extension cover a broad Section of the requirements of recent VR and AR applications for wearable interactive audio reproduction technology, and enable the binaural simulation of feature-rich audio scenes . In this section, we describe previously unpublished aspects of EAX/OpenAL functions, and how Magic Leap Soundfield Audio (MSA) expands upon these with the ambition to provide for the Audio Metaverse a complete yet efficient spatial rendering API – including the following enhancements:
extended spectral control of source radiation pattern for more realistic modeling of 6DoF objects
per-source spatialization of clustered reflections, enabling improvements in the spatial imaging of distant audio objects and in geometry-driven acoustic simulation
MSA also differs from OpenAL EFX by more explicitly distinguishing low-level vs. high-level API: listener geometry (position and orientation tracking) is handled in the high-level scene description API, while the low-level rendering API, described in this section, is egocentric (i.e. source position and orientation coordinates are defined as ”head-relative”).
Ii-B1 3-band parametric spectral control
Three control frequencies are specified globally and enforced throughout the audio rendering API to set or calculate frequency-dependent properties and parameters. This approach, inherited from Ircam Spat, I3DL2 and EAX, exploits the proportionality property of the shelving equalizer described in  and Fig. 8, which allows lumping into one second-order IIR filter Section (”biquad”) the combination of several cascaded dual-shelving equalizer effects sharing a common set of control frequencies [35, 45, 6]. This enables the computationally efficient implementation of global and per-source gain corrections in three frequency bands throughout the MSA rendering engine.
Ii-B2 Frequency-dependent source directivity
As in OpenAL EFX, the property of a source is frequency-dependent so that the direct sound component will be automatically filtered according to the orientation of the source relative to the position of the listener, resulting in a natural sounding dynamic low-pass filtering effect in accordance with listener navigation or when the source points away from the listener.
Additionally, a dual-shelving filter approximating the diffuse-field transfer function of the sound source is derived and applied to its reflections and reverberation. As a result, simulating a source that is more directive at higher frequencies, as is typical of natural sound sources, automatically produces a low-pass spectral correction of its reverberation. This is a noticeable timbral characteristic specific to each sound source according to its directivity vs. frequency, predicted by the statistical model developed in . (In Ircam Spat, this natural property of a sound source can be tuned directly by adjusting its ’’ dual-shelving filter parameters [35, 16]).
Ii-B3 Clustered reflections spatialization
MSA includes a method for efficient control and rendering of the early reflections, originally proposed in , using a shared clustered reflections generator (see Fig. 11 and Sections III-B and III-D for more details), which enables controlling separately for each individual sound source the following properties:
: relative delay of the first reflection received by the listener for this sound source.
: the gain of the early reflections cluster for each sound source is adjustable in three frequency bands, as described previously.
: an azimuth angle, relative to the listener’s ”look direction”, that specifies the average direction of arrival of the early reflections cluster for this source.
: equal to the magnitude of the combined Gerzon Energy Vector of the reflections , represents the directional focus of the clustered reflections around their average direction of arrival (see Fig. 7) – with maximum directional concentration for a value of 1.0. For a value of 0.0, the incidence of the reflections is evenly distributed (isotropic) around the listener’s head.
By default, the clustered early reflections are spatialized to emanate from a direction sector centered on the direct sound arrival for each source (as illustrated in Fig. 7). As in Ircam Spat, this behavior supports the auditory perception of source localization without requiring geometrical and acoustical environment data [46, 44]. Alternatively, in VR or AR applications where a physical environment description is accessible, the clustered reflections properties listed above may be controlled programmatically for each sound source (see Section IV).
Ii-B4 Enhanced distance model
As illustrated in Fig. 9, the automatic attenuation of sound components vs. distance is extended to manage separately the direct sound, the early reflections and the late reverberation intensities for each sound source. For an unobstructed sound source located beyond the in the same room as the listener, the following behavior applies by default:
The early reflections are subject to the same relative attenuation as the direct sound component. This emulates the perceptual model realized in Ircam Spat, wherein the attenuation applied to the early reflections segment substantially matches the attenuation applied to the direct sound as the source distance varies [46, 35, 44].
The late reverberation is subject to a frequency-dependent attenuation according to the room’s reverberation decay time, as explained in Section II-B5 below.
Ii-B5 The reverberation fingerprint of a room
Following an acoustic pulse excitation in a room, the reflected sound field builds up to a mixing time after which the residual acoustic reverberation energy is distributed uniformly across the enclosure [36, 60]. A numerical simulation example included in Appendix illustrates this phenomenon. The ensuing reverberation decay can be simulated accurately for VR or AR applications by considering, regardless of source or listener position and orientation [39, 36]:
an intrinsic property of the room: its reverberation fingerprint, encompassing the reverberation decay time (function of frequency) and the room’s cubic volume, through which the source’s acoustic power is scattered
an intrinsic property of the sound source: its diffuse-field transfer function (see Section II-B2), which scales the acoustic power radiated into the room, accumulated over all directions of emission.
The distance model described previously in Section II-B4 requires the rendering engine to attenuate the reverberation with a per-source frequency-dependent offset that is a function of the source-listener distance and depends on the reverberation decay time, as illustrated in Fig. 10. This offset accounts for the variation of the remaining integrated energy under the reverberation’s power envelope after an onset time equal to the source-to-listener time-of-flight augmented by the . The property value can be thought of as representing the acoustic power amplification by the virtual room’s reverberation to a sound generated by the user, such as one’s own voice or footsteps .
Ii-C Summary: Audio Scene Rendering for the Metaverse
We have presented a generic egocentric rendering API for interactive 6DoF spatial audio and multi-room reverberation, emphasizing auditory plausibility and computational efficiency, compatible with both physical and perceptual scene description models, along the principles proposed previously in . Control via a perceptual audio spatialization paradigm is illustrated in . Mapping from a geometrical and physical scene model will be discussed in Section IV.
In particular, we extend the OpenAL EFX feature set to enable per-source spatialization of the early reflections by the clustered reflections rendering method previously envisioned in  and detailed further in Section III. This enables simulating the perceptually-based distance effect afforded by Ircam Spat’s Source Presence parameter [46, 35, 44].
For AR and VR application, the Audio Metaverse is modelled as a collection of rooms each characterized by a ”reverberation preset” representative of its response to an omnidirectional sound source located near the receiver. We exploit the notion of reverberation fingerprint, which provides a data-efficient characterization of the perceptually relevant characteristics of a room’s reverberation that are independent of source or receiver parameters . Virtual sound sources represented by their directivity properties can be seamlessly ”dropped” into the environment at rendering time as audio objects assigned arbitrary dynamic position and orientation.
The reverberation fingerprints of the rooms composing a sector of the Metaverse may be inventoried for future retrieval and rendering, and estimated by automated acoustic analysis[53, 23]. A future extension is the validation and extension of this model for ”non-mixing” or anisotropic reverberation environments such as open or semi-open spaces, coupled rooms, or corridors . Future work also includes incorporating the capability of representing spatially extended sound sources in this rendering API [61, 84].
Iii A Binaural Immersive Rendering Engine
Fig. 11 displays an audio processing architecture that supports the rendering API functions reviewed above. It indicates where dual-shelving equalizers are inserted in order to realize per-source or global spectral corrections (see Section II-B1). In this section, we build this audio renderer in successive stages of extension, following the evolution outlined in Table I:
3D positional direct sound rendering for multiple audio objects, including spatially extended sound sources
spatially diffuse rendering of the late reverberation in the listener’s room (Section III-B)
addition of the acoustic reverberation emanating from adjacent rooms (Section III-C)
computationally efficient rendering of the early reflections for each sound source (Section III-D).
Iii-a Binaural 3D Positional Audio Rendering
In this section, we address the binaural rendering of the direct sound component for a collection of audio objects, critical for faithful reproduction of their individual location, loudness and timbre . Fig. 12 shows the reference Head-Related Transfer Function (HRTF) filter model for one source at a given location: frequency-independent bilateral delays cascaded with a minimum-phase filter stage [38, 49]. Figures 13 and 14 display HRTF filter specification examples based on measurements performed on artificial heads. We focus on efficient methods applicable for virtually unlimited object counts, where per-source computation is minimized by (a) panning and mixing in an intermediate multichannel format, then (b) transcoding to binaural output (see e.g. ). Fig. 11 distinguishes three multichannel mixing modes:
Travis  proposed ambisonic to binaural transcoding for interactive audio, including 3DoF compensation of listener head movements by rotation of the B-Format mix. The Virtualizer performs ambisonic decoding to a chosen head-locked virtual loudspeaker layout, as illustrated in Fig. 14. This approach is extended to high-order ambisonic encoding in . Alternatively, a pairwise amplitude panning technique such as VBAP  may be used, wherein listener head rotations are accounted for at the per-source panning stage. For practical channel counts, these approaches are inaccurate in terms of HRTF reconstruction, which can result in perceived localization errors [32, 86, 9]
. Frequency-domain parametric processing can mitigate this drawback, at the cost of increasing virtualizer complexity[26, 15, 10].
Iii-A2 Left, Right (bilateral)
This approach realizes direct ITD synthesis for each individual sound source, as illustrated in Fig. 15. It is equivalent to the linear decomposition of the minimum-phase HRTF filter using a basis of spatial panning functions [32, 49] – for instance: first-order Ambisonics [42, 32], generalization to spherical harmonics of any order , or bilateral pairwise amplitude panning . These methods can be readily extended to include per-source processing for near-field positional audio rendering , and to customize ITD, spatial or spectral functions in order to match an individual listener’s HRTF dataset [32, 49].
This third approach, introduced in , allows rendering spatially extended sound sources (see ). Fig. 11 assumes that an identical pairwise amplitude panning method is employed for both the Standard mix and the Diffuse mix, the latter subjected to a decorrelation processing filter bank prior to virtualization (see e.g. [47, 14]).
Iii-B Simulating Natural Acoustic Reverberation
To incorporate environmental acoustics in the interactive audio rendering engine, EAX and I3DL2 adopted a processing architecture analogous to music workstations, where a reverberator operates as an auxiliary effect that can be shared between multiple sound sources – except for one important difference: here, the reverberator input filter can be tuned individually for each source according to its diffuse-field transfer function (see Sections II-A2, II-B2 and Fig. 11).
Reverberation engine designs have been proposed wherein a per-source early reflections generator feeds into a late reverberation processor that can be shared between multiple sound sources located in the same room as the listener [43, 21]. Here, as shown in Fig. 11, the reflections generator is shared too (clustered reflections module, Section III-D) and the late reverberation is rendered by a reverb module which receives a separate mix of the source signals.
Parametric reverberation algorithms suitable for this purpose have been extensively studied, including methods for analyzing, simulating or ”sculpting” recorded, simulated or calculated reverberation responses [43, 24, 83, 21, 17, 87, 22]. Many involve a recirculating delay network as illustrated in Fig. 16 – where, referring to the reverberation API properties introduced in Section II-A2 [43, 37, 24],
is a set of recirculating delay units, whose summed length may be scaled via the modal property
denotes a matrix of feedback coefficients, whose sparsity may be adjusted via the property
each denotes an absorptive filter realizing a frequency-dependent dB attenuation proportional to the length of the delay and to the reciprocal of the decay time.
These filters may be realized with proportional parametric equalizers as described in  and Section II-B1, such that the property (resp. ) sets the decay time at (resp. ) relative to the mid-frequency . Additionally, cascaded reverb and Virtualizer processing is normalized to match the  and mimic the low-frequency interaural coherence contour observed in natural diffuse fields (Fig. 17) [38, 52, 87].
Iii-C Multi-Room Reverberation Rendering
As illustrated in Figures 6 and 11, several reverberators can be employed to simulate neighboring rooms in addition to an occupied room. Rendering multiple rooms requires both additional logic governing the amount of signal sent from audio sources to different reverberation processing units as well as a more complicated mechanism for handling multiple reverberation units’ output.
Any source-specific qualities desired in the reverberation output must be imparted when signal is sent from the source to a reverberation processing unit. Existing API solutions employ a reverb send module that sets gain, equalization, and delay for signal sent from a source to a specific reverberation unit [57, 59, 41]. A significant runtime complexity of a multi-room system stems from determining desired values for these source sends, which may simply be based on source position relative to the listener and room geometry but could also account for source orientation and radiation pattern, particularly relevant if a source is positioned near a boundary between rooms. Considerations such as these, that have been left to the caller by historical APIs, are discussed in Section IV.
In the case of a multi-room system that renders adjacent rooms, some method of spatially panning reverberation output becomes necessary, as illustrated in Fig. 18. As proposed in  and shown in Fig. 11, spatial reverberation panning may be achieved by routing reverberator output through a virtualization algorithm, requiring a number of sufficiently decorrelated reverberation output channels determined by the desired spatial resolution for reverberation output panning. Alternatively, a dedicated bus whose channels all undergo decorrelation processing prior to virtualization, labeled ‘Diffuse’ in  and Fig. 11, can be employed. In any case, the spatial resolution of reverb panning cannot exceed that provided by the number and placement of virtual loudspeakers used to render spatial content.
Some representation of a desired spatial panning result is required by the API, ideally one that minimizes its data footprint without compromising signal processing capability; as such it need not specify data beyond the spatial audio rendering system’s ability to translate into perceivable difference.
The EAX Reverb effect allows the setting of direction and dispersion (a measure of how wide or narrow the diffuse sound should be) to pan early reflections and late reverberation (see Fig. 7) . These two properties together describe a Gerzon vector as discussed in , and on which the MSA source and properties, used to pan clustered reflections for a source, are similarly based (as discussed in Section II-B3).
The desired spatial resolution for reverberation panning must also be considered by this panning representation, whether by one straightforward such as the one above or by one more sophisticated. Spherical harmonics are a potential candidate for higher resolutions and more complex patterns, with the benefits that they are virtual speaker agnostic, can vary in order, and can render at lower orders at runtime with minimal audio disruption if resources are constrained. Direct representation of virtual speaker gains is another option, which while dependent on prior knowledge of the virtual speaker configuration in use could leverage standardized positions for virtual speakers, or subsets of these.
Non-occupied rooms also require an additional equalization stage to apply attenuation as a result of distance and the manner of each room’s connection to the occupied room. For example, one may wish to render bleed of a neighboring room through a wall (i.e. occluded reverberation) as in Fig. 18. In , an equalization filter can be applied to reverberation output to simulate these effects; however, the spectral content does not vary across the spatial field.
One final consideration of a multi-room system is the issue of acoustic coupling between physical rooms the system seeks to model. Output signal from one or more reverberation units may be cross-fed to the input of one or more other reverberation units to simulate this effect, with the level and equalization of the cross-fed signal for each coupling set via API based on how the physical rooms are connected.
It has historically fallen on the caller of low-level rendering APIs (such as EAX or OpenAL) to determine the proper runtime values for reverb send and reverb output parameters in order to realize an intended multi-room environment, whether bespoke, observed, or some combination of the two. General considerations in this respect will be discussed in Section IV.
Iii-D Spatialization of Early Reflections
In order to realize differentiated early reflections control for each sound source (see Section II-B3), a computationally efficient approach is offered by the clustered reflections rendering method proposed in . As shown in Fig. 11, source signals are panned and mixed into a multichannel reflections send bus which feeds the aggregated contribution of all sound sources into a shared clustered reflections processing module.
In this section, we describe a realization in which this send bus employs a 3-channel first-order ambisonic format . As illustrated in Fig. 19, encoding is performed with the 2-dimensional ambisonic panning equations :
where is the triplet of panning gains, while and are respectively the per-source and properties defined in Section II-B3.
As shown in Fig. 20, the reflections send bus signal is decoded into 6 head-locked virtual locations spaced equally around the listener in the horizontal plane (Fig. 14), by use of the ”in-phase” ambisonic decoding equations :
for = 1..6, where denotes a decoded output channel signal and denotes the corresponding virtual loudspeaker azimuth angle in the layout of Fig. 14. Each decoded signal is fed into an individual reflections generator realized by a nested all-pass filter  to produce a 6-channel reflections output signal that is forwarded via the Standard multichannel mix bus to the global Virtualizer (Fig. 11).
Iv Considering Acoustic Propagation
When considering support of advanced propagation-related features, one finds existing rendering engines such as MSA and its predecessor OpenAL EFX provide the means to realize an intended perceptual result, such as specifying equalization to be applied to a spatial source to simulate an occlusion effect. However, the conditions under which the effect should be applied, and the equalization itself, are left to the programmer (see e.g. [57, 59, 58]). These APIs prioritize universality, leaving choices regarding how best to utilize the signal processing components exposed by the rendering API to users building a wide range of applications, from the music and media uses discussed in Section I-A to the immersive 6DoF VR or AR experiences encountered in the Metaverse.
Complete spatial audio solutions supporting propagation features, including commercially available products such as Audiokinetic’s Wwise  or Apple’s PHASE , are also faced with determining the desired momentary perceptual result based on position and orientation of a user or avatar within a dynamic acoustic environment. While reasonable for a lightweight interface prioritizing portability to pass on these determinations to the developer, the emergence and success of more dedicated and feature-rich solutions indicates that with increased familiarity, improved computational capabilities, and larger yet more focused use cases (e.g. the Metaverse), support of higher-level propagation features has become more desired and expected. This expanding prevalence prompts discussion of terms and principles, such as material properties relevant to different propagation features, for any solution seeking to support these features while maintaining universality.
Ultimately, we envision a similarly-portable propagation-focused software layer on top of the existing rendering interface described in Section II, encapsulating the related computation needed to drive the perceptual properties of the lower-level interface. Algorithmic design decisions involving this propagation layer may span a wide range and have implications on how the perceptual layer beneath is employed, adding to the challenge of a universal solution. A primary role of such a propagation layer, as it seeks to simulate propagation in an acoustic environment, is the management of the acoustic environment itself. The data required to represent the acoustic environment is to some extent determined by algorithmic choices, but vocabulary and representation of perceptually measured properties of individual acoustic objects or elements in the acoustic scene as they relate to propagation features can be more universally considered.
In considering obstruction of the direct path (Fig. 21
), the acoustic transmission loss through an obstructing object is determined by variances in the acoustic impedance along the sound path as it travels through the object. It is convenient to lump the perceptual result of this multi-stage effect (with the assumption the surrounding medium is air near room temperature and atmospheric pressure) into a single set of equalization values that can be stored with the object’s acoustic data as a property named. This transmission equalization can then be simply applied to sound paths traveling through the object. The number of equalization bands specified in should correspond with that used universally throughout the API; in the case of MSA the equalization values correspond to the three-band paradigm discussed in Section II-B1.
Relating to objects’ effects on reflections (Fig. 22), one considers the ‘acoustic transmittance’ of a reflecting surface in a similar manner as transmission, with a reflective impedance whose resulting effect we save with objects’ acoustic data under .
Put more simply, represents the gain and equalization resulting from a single reflection (or ’bounce’) off the object, simplifying computation of higher order reflections’ compounded effects. Specular and diffuse scattering due to reflection are not separately considered at the object level, although they could be affected by other acoustic object properties, such as geometry or a roughness trait. may be applied at an equalization stage on either an additional source, modelling a discrete specular reflection of an original source, or it may be applied on a clustered reflections send or output, modelling grouped reflections and having a diffuse effect imparted by the nested all-pass reflections cluster algorithm (see Section III-D). In this latter case, the of multiple nearby objects may be considered in setting a clustered reflections output equalization stage. It may be noted that transmission, transmittance, and absorption must sum to one, at least for physics-compliant objects. Because absorption is not quantified, violation of this law can only be identified if and values sum in excess of one, if such conservation is a priority. As with other such intention-dependent considerations, enforcement of physically realistic values is left to the caller of the API.
Discussion of an acoustic object’s physical properties in the context of diffraction becomes less clear (Fig. 23). Acoustic transmittance may play a marginal role, but more important are aspects of geometry, including object shape, size, and orientation. Discretely computed diffraction images, e.g. placed based on shortest-path search around the object at runtime (as in Fig. 23), may wish to consider the object’s in addition to the original source’s distance attenuation properties when adjusting the image gain and equalization.
Diffraction models that depend on offline computation such as 
may be reliant on whether such computation considers material properties of the acoustic object for variation of these properties to have an audible effect, unless a runtime variable is expressly provided for this purpose. Machine learning based approaches have also been considered that address multiple propagation effects simultaneously.
Iv-D Computational Efficiency and Hybrid Architecture
Any performance-conscious propagation layer, or audio system in general, should have as few variables to set or have to compute at runtime as necessary. Where possible, psychoacoustic as well as signal processing principles can be exploited to identify and optionally eliminate processing or variables that result in little to no difference in the user experience, guiding design decisions surrounding sufficient yet succinct runtime data representation that weigh fidelity and realism against performance and computational cost. Existing portable or modular solutions generally rely on perceptual audio / acoustics / psychoacoustics concepts to guide the form of runtime data, both for conciseness and universality, while top-to-bottom solutions can select the form of runtime data most efficiently consumed at runtime within their system.
Another technique for diminishing runtime demands is to remove variables and computation from runtime operations entirely, and replace them with a compressed representation or other more easily consumed result of some offline computation, whether performed as a dedicated prior step, or as part of a compile or build process, or in a concurrent side process, perhaps on the edge or cloud. For example, Rungta et al. propose pre-computing a diffraction kernel for individual acoustic objects that considers object properties including geometry . This kernel is then used at runtime to compute the momentary transfer function to apply to the direct path based on input and output incident angles. The kernel could be replaced by a lookup table containing values for perceptual properties such as binaural delay, pan direction, and equalization to be set in a lower level rendering engine such as the solution described in Sections II and III.
Larger scale examples of this strategy that target room or environment simulation also exist. Microsoft’s Project Acoustics simulates room impulse responses using computationally intensive acoustic simulation run on Microsoft’s cloud services platform, but from that analysis derives a small number of parameters for runtime use [65, 18]. Magic Leap’s Soundfield Audio solution (MSA) analyzes captured audio signals in a separate software component to compute reverberation properties, creating an efficient data representation of the acoustic environment (using the reverberation fingerprint discussed in Section II-B5) which may be derived from the analysis of in-situ microphone-captured audio signals [53, 23].
In combination with emerging wearable augmented reality hardware and with the growing cloud-based ecosystem that underpins the Metaverse, binaural immersive audio rendering technology is an essential enabler of future co-presence, remote collaboration and virtual entertainment experiences. We have presented an efficient object-based 6-degree-of-freedom spatial audio rendering solution for physically-based environment models and perceptually-based musical soundscapes.
In order to support navigable Metaverse experiences where sound source positions and acoustic environment parameters may be served at runtime from several independent applications, the proposed solution facilitates decoupling and deferring to rendering time the specification of the positional coordinates of sound sources and listeners, and of the geometric and acoustic properties of rooms or nearby obstacles and reflectors, real or virtual, that compose the acoustic environment. The design of the renderer and of its programming interface prioritize plausibility (suspension of disbelief) with minimal computational footprint and application complexity. They expose, in particular, these novel capabilities:
The characterization of rooms by their reverberation fingerprint enables the faithful matching of acoustic reverberation decays, based on a compact data representation of virtual or real environments.
An efficient method for per-object control and rendering of clustered reflections, facilitating perceptually-based distance simulation for each sound source, and avoiding the burden of reproducing early reflections individually in physically-based models.
The proposed solution builds upon well established and extensively practiced interactive audio standards, and is implemented as a core software component in Magic Leap’s augmented reality operating system.
The authors wish to acknowledge the contribution of Sam Dicker to the development of the rendering engine presented in this paper, as its principal software architect from its beginnings at Creative Labs to its recent maturation at Magic Leap. Jean-Marc would like to acknowledge collaborators in previous work referenced in this paper: Jean-Michel Trivi, Daniel Peacock, Pete Harrison and Garin Hiebert for the development and deployment of the EAX and OpenAL APIs; Olivier Warusfel, Véronique Larcher and Martin Walsh for acoustic and rendering research; Antoine Chaigne for initiating and supporting the investigation as doctoral supervisor.
-  (2021-05) Perceptual analysis of directional late reverberation. J. Acoustical Soc. of America. Cited by: §II-C.
-  (Website) External Links: Cited by: §IV.
-  (2018) A Perceptual Evaluation of Individual and Non-Individual HRTFs: A Case Study of the SADIE II Database. Applied Sciences 8 (11). Cited by: Fig. 14.
-  (2019-08) Reverberation Loudness Model for Mixed-Reality Audio. In Audio Eng. Soc. Conf. Headphone Tech., Cited by: 3rd item, §II-B5, §III-B.
-  (2018-10) Audio Application Programming Interface for Mixed Reality. In Audio Eng. Soc. 145th Conv., Cited by: §II-B1, §II.
-  (2018-10) Practical Realization of Dual-Shelving Filter Using Proportional Parametric Equalizers. In Audio Eng. Soc. 145th Convention, Cited by: Fig. 8, §II-B1.
-  (Website) External Links: Cited by: §IV.
-  (2021-01) The Metaverse Primer. External Links: Cited by: §I-B.
-  (2021-01) Binaural Reproduction Based on Bilateral Ambisonics and Ear-Aligned HRTFs. In IEEE/ACM Trans. on Audio, Speech, and Language Processing, Cited by: §III-A1, §III-A2.
-  (2010-10) A New Method for B-Format to Binaural Transcoding. In Audio Eng. Soc. 40th International Conf., Cited by: §III-A1.
-  (2020-07) Sound Externalization: A Review of Recent Research. Trends in Hearing. Cited by: §I-C.
-  (2004-11) George Lucas - Technology and the Art of Filmmaking. Mix Online. External Links: Cited by: §I-A.
-  (1983) Spatial hearing: the psychophysics of human sound localization. MIT Press. Cited by: §III-A.
-  (2004-10) Audio Signal Decorrelation Based on a Critical Band Approach. In Audio Eng. Soc. 117th Conv., Cited by: §III-A3.
-  (2008-10) Phantom Materialization: A Novel Method to Enhance Stereo Audio Reproduction on Headphones. IEEE Trans. Audio, Speech, and Language Processing. Cited by: §III-A1.
-  (2015-09) Twenty Years of Ircam Spat: Looking Back, Looking Forward. In 41st International Computer Music Conference, Cited by: §II-B2.
-  (2013-06) Parametric Control of Convolution Based Room Simulators. In International Symposium on Room Acoustics, Cited by: §III-B.
-  (2020-08) Cloud-Enabled Interactive Sound Propagation for Untethered Mixed Reality. In Audio Eng. Soc. Conf. on Audio for Virtual and Augmented Reality, Cited by: §IV-D.
-  (1971-01) The Simulation of Moving Sound Sources. J. Audio Eng. Soc.. Cited by: §II.
-  (Website) External Links: Cited by: Rendering Spatial Sound for Interoperable Experiences in the Audio Metaverse.
-  (2015-06) Efficient Synthesis of Room Acoustics via Scattering Delay Networks. IEEE/ACM Trans. Audio, Speech, and Language Proc.. Cited by: §III-B, §III-B.
-  (2020-09) Velvet-Noise Feedback Delay Network. In 23rd International Conference on Digital Audio Effects, Cited by: §III-B.
Blind Reverberation Time Estimation using a Convolutional Neural Network. In International Workshop on Acoustic Signal Enhancement (IWAENC), Cited by: §II-C, §IV-D.
-  (2002) Reverberation Algorithms. In Applications of Digital Signal Processing to Audio and Acoustics, Cited by: §III-B, §III-D.
-  (1992-03) General Metatheory of Auditory Localisation. In Audio Eng. Soc. 92nd Convention, Cited by: 4th item.
-  (2007-08) Binaural 3-D Audio Rendering Based on Spatial Audio Scene Coding. In Audio Eng. Soc. 123rd Convention, Cited by: §III-A1.
-  (2004-06) Augmented Reality Audio for Mobile and Wearable Appliances. J. Audio Eng. Soc.. Cited by: 3rd item, §I-C.
-  (2014-10) MPEG-H Audio - The New Standard for Universal Spatial / 3D Audio Coding. In Audio Engineering Society 137th Convention, Cited by: Fig. 1, §I-A.
-  (2005-06) OpenAL 1.1 Specification and Reference. Standard, Creative Labs. External Links: Cited by: 2nd item, 1st item, §II-A1, §II-A.
-  (1998-06) 3D Audio Rendering and Evaluation Guidelines. Standard, MIDI Manufacturers Association. Cited by: 1st item, §II-A1.
-  (1999-09) Interactive 3D Audio Rendering Guidelines - Level 2.0. Standard, MIDI Manufacturers Association. Cited by: Fig. 5, §II-A2, §II-A.
-  (1999-03) A Comparative Study of 3-D Audio Encoding and Rendering Techniques. In Audio Eng. Soc. 16th International Conference, Cited by: Fig. 15, §III-A1, §III-A2, §III-D.
-  (2006-05) Scene Description Model and Rendering Engine for Interactive Virtual Acoustics. In Audio Eng. Soc. 120th Convention, Cited by: §II-C.
-  (1999) Environmental Audio Extensions: EAX 2.0. Creative Technology, Ltd.. External Links: Cited by: §II-A2, §II-A.
-  (1995) Spat Reference Manual. IRCAM. External Links: Cited by: Fig. 8, 2nd item, §II-B1, §II-B2, §II-C.
-  (1997-Sept) Analysis and Synthesis of Room Reverberation Based on a Statistical Time-Frequency Model. In Audio Eng. Soc. 103rd Convention, Cited by: §II-B2, §II-B5, Rendering Spatial Sound for Interoperable Experiences in the Audio Metaverse, Rendering Spatial Sound for Interoperable Experiences in the Audio Metaverse.
-  (1991-02) Digital Delay Networks for Designing Artificial Reverberators. In Audio Eng. Soc. 90th Convention, Cited by: §III-B.
-  (1995-02) Digital Signal Processing Issues in the Context of Binaural and Transaural Stereophony. In Audio Eng. Soc. 98th Convention, Cited by: §III-A, §III-B.
-  (2016-09) Augmented Reality Headphone Environment Rendering. In Audio Eng. Soc. Conference on Audio for Virtual and Augmented Reality, Cited by: 3rd item, §II-B5, §II-C, Rendering Spatial Sound for Interoperable Experiences in the Audio Metaverse.
-  (2017-10) Efficient Structures for Virtual Immersive Audio Processing. In Audio Eng. Soc. 143rd Convention, Cited by: §III-A.
-  (2006-10) Binaural Simulation of Complex Acoustic Scenes for Interactive Audio. In Audio Eng. Soc. 121st Convention, Cited by: Fig. 7, §II-B3, §II-B, §II-C, Fig. 11, §III-A2, §III-A3, §III-C, §III-C, §III-C, §III-D.
-  (1998-09) Approaches to Binaural Synthesis. In Audio Eng. Soc. 105th Convention, Cited by: §III-A2.
-  (1997) Efficient Models for Reverberation and Distance Rendering in Computer Music and Virtual Audio Reality. In International Computer Music Conference, Cited by: Fig. 16, §III-B, §III-B.
-  (1999-01) Real-Time Spatial Processing of Sounds for Music, Multimedia and Interactive Human-Computer Interfaces. ACM Multimedia Systems Journal. Cited by: 2nd item, §II-B3, §II-C, §II.
-  (2015-10) Proportional Parametric Equalizers — Application to Digital Reverberation and Environmental Audio Processing. In Audio Eng. Soc. 139th Convention, Cited by: Fig. 8, §II-B1, §III-B.
-  (1995-06) Structured Model for the Representation and the Control of Room Acoustical Quality. In International Congress on Acoustics, Cited by: 2nd item, §II-B3, §II-C.
-  (1995) The Decorrelation of Audio Signals and Its Impact on Spatial Imagery. Computer Music J. 19 (4), pp. 71–87. Cited by: §III-A3.
-  (1993-11) Auralization - An Overview. J. Audio Eng. Soc.. Cited by: §II.
-  (2001-05) Techniques de Spatialization des Sons pour la Réalité Virtuelle. Ph.D. Thesis, Université Pierre et Marie Curie, Paris. Cited by: Fig. 13, Fig. 15, §III-A2, §III-A.
-  (1998-09) Equalization Methods in Binaural Technology. In Audio Eng. Soc. 105th Convention, Cited by: Fig. 14.
-  (2020-04) Minecraft, Fortnite And Avatars; How Lockdown Is Changing The Future Of Live Music. Forbes. External Links: Cited by: 4th item.
-  (2010-09) Investigations on an Early-Reflection-Free Model for BRIRs. J. Audio Eng. Soc. Cited by: Fig. 17, §III-B.
-  (2017-10) Blind Estimation of the Reverberation Fingerprint of Unknown Acoustic Environments. In Audio Eng. Soc. 143rd Convention, Cited by: §II-C, §IV-D.
-  (2003-06) A 3D Ambisonic Based Binaural Sound Reproduction System. In Audio Eng. Soc. 24th Conference: Multichannel Audio, The New Reality, Cited by: §III-A1.
-  (2011-07) OpenSL ES Specification, Version 1.1. Standard, The Khronos Group Inc.. External Links: Cited by: Fig. 4, §II-A1, §II-A2.
-  (2019-03) Toward Six Degrees of Freedom Audio Recording and Playback Using Multiple Ambisonic Sound Fields. J. Audio Eng. Soc.. Cited by: §I-B.
-  (2006-07) OpenAL Effects Extension Guide. Standard, Creative Labs. Cited by: Fig. 5, §II-A2, §II-A3, §II, §III-C, §III-C, §III-C, §IV.
-  (2003) EAX 4.0 Introduction. Creative Technology, Ltd.. External Links: Cited by: Fig. 6, Fig. 18, §IV.
-  (2003) EAX 4.0 Programmer’s Guide. Creative Technology, Ltd.. External Links: Cited by: §II-A3, §III-C, §IV.
-  (1993-02) Playing Billiards in the Concert Hall: The Mathematical Foundations of Geometrical Room Acoustics. Applied Acou.. Cited by: §II-B5, Rendering Spatial Sound for Interoperable Experiences in the Audio Metaverse.
-  (2004-10) Decorrelation Techniques for the Rendering of Apparent Sound Source Width in 3D Audio Displays. In 7th International Conference on Digital Audio Effects, Cited by: §II-C, §III-A3.
-  (2007) The theory and technique of electronic music. World Scientific. External Links: Cited by: §II.
-  (1997-06) Virtual Sound Source Positioning Using Vector Base Amplitude Panning. J. Audio Eng. Soc.. Cited by: Fig. 1, §III-A1.
-  (2021-11) MPEG Standards for Compressed Representation of Immersive Audio. Proc. of the IEEE. Cited by: §I-B.
-  (2018-07) Parametric Directional Coding for Precomputed Sound Propagation. ACM Trans. on Graphics. Cited by: §IV-D.
-  (2019-10) Recommendation itu-r bs.2076-2 - audio definition model. International Telecommunication Union. Cited by: Fig. 1, §I-A.
-  (2012) Scalable Format and Tools to Extend the Possibilities of Cinema Audio. SMPTE Motion Imaging Journal 121 (8), pp. 63–69. External Links: Cited by: Fig. 1, §I-A.
-  (2017) Immersive sound: the art and science of binaural and multi-channel audio. Focal Press. Cited by: §I-A.
-  (2008-10) Near-Field Compensation for HRTF Processing. In Audio Eng. Soc. 125th Convention, Cited by: §III-A2.
-  (2019-11) Headphone Technology: Hear-Through, Bone Conduction, and Noise Canceling. J. Audio Eng. Soc.. Cited by: §I-C.
-  (2018-03) Diffraction Kernels for Interactive Sound Propagation in Dynamic Environments. IEEE Trans. Visual. and Comp. Graphics. Cited by: §IV-C, §IV-D.
-  (1999-09) Creating Interactive Virtual Acoustic Environments. J. Audio Eng. Soc. Cited by: 4th item.
-  (1999-Sept) AudioBIFS: Describing Audio Scenes with the MPEG-4 Multimedia Standard. IEEE Trans. Multimedia. Cited by: §I-B, 4th item.
-  (2015-04) Clean Audio for TV broadcast: An Object-Based Approach for Hearing-Impaired Viewers. J. Audio Eng. Soc.. Cited by: 3rd item.
-  (2021) Game audio programming 3: principles and practices. CRC Press. Cited by: 1st item, §I-B.
-  (2018-06) Soundfield Audio Component Reference. SDK v0.15.0 Documentation Magic Leap, Inc.. External Links: Cited by: Fig. 5, §II-B1, §II.
-  (1992) Snow crash. Bantam Books. Cited by: §I-B.
-  (2021-03) Learning Acoustic Scattering Fields for Dynamic Interactive Sound Propagation. In IEEE Conf. on Virtual Reality and 3D User Interfaces, Cited by: §IV-C.
-  (2021-06) The OpenXR Specification. The Khronos Group Inc.. External Links: Cited by: §I-B, §II-A.
-  (1997) The Virtual Reality Modeling Language. Standard, VRML Consortium Inc.. Cited by: §I-B, §II-A1.
-  (1996-11) Virtual Reality Perspective on Headphone Audio. In Audio Eng. Soc. 101st Convention, Cited by: §III-A1.
-  (2002-08) Rendering MPEG-4 AABIFS Content Through A Low-Level Cross-Platform 3D Audio API. In IEEE International Conf. on Multimedia and Expo, Cited by: §II-C.
-  (2012-07) Fifty Years of Artificial Reverberation. IEEE Trans. Audio, Speech, and Language Processing. Cited by: §III-B.
-  (2010-08) A 3-D Immersive Synthesizer for Environmental Sounds. IEEE Trans. Audio Speech and Language Processing. Cited by: §II-C.
-  (2020-07) WebXR Device API. W3C. External Links: Cited by: §I-B.
-  (2017-Sept) Analysis of Binaural Cue Matching using Ambisonics to Binaural Decoding Techniques. In 4th International Conference on Spatial Audio, Cited by: §III-A1.
-  (2019-04) Artificial Enveloping Reverberation for Binaural Auralization Using Reciprocal Maximum-Length Sequences. J. Acoust. Soc. Am.. Cited by: §III-B, §III-B.
-  (2019) Ambisonics: a practical 3d audio theory for recording, studio production, sound reinforcement and virtual reality. Springer. Cited by: 2nd item, §III-D.