Log In Sign Up

Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content

by   Haici Yang, et al.

In the stereo-to-multichannel upmixing problem for music, one of the main tasks is to set the directionality of the instrument sources in the multichannel rendering results. In this paper, we propose a modified variational autoencoder model that learns a latent space to describe the spatial images in multichannel music. We seek to disentangle the spatial images and music content, so the learned latent variables are invariant to the music. At test time, we use the latent variables to control the panning of sources. We propose two upmixing use cases: transferring the spatial images from one song to another and blind panning based on the generative model. We report objective and subjective evaluation results to empirically show that our model captures spatial images separately from music content and achieves transfer-based interactive panning.


page 1

page 2

page 3

page 4


Multiple Style Transfer via Variational AutoEncoder

Modern works on style transfer focus on transferring style from a single...

Translating Visual Art into Music

The Synesthetic Variational Autoencoder (SynVAE) introduced in this rese...

Learning a Latent Space of Style-Aware Symbolic Music Representations by Adversarial Autoencoders

We address the challenging open problem of learning an effective latent ...

Explicitly disentangling image content from translation and rotation with spatial-VAE

Given an image dataset, we are often interested in finding data generati...

Learning Style-Aware Symbolic Music Representations by Adversarial Autoencoders

We address the challenging open problem of learning an effective latent ...

Factorising Meaning and Form for Intent-Preserving Paraphrasing

We propose a method for generating paraphrases of English questions that...

Learning Interpretable Representation for Controllable Polyphonic Music Generation

While deep generative models have become the leading methods for algorit...

1 Introduction

While home theater and surround sound have become more common and affordable, most music is still in stereo. Approaches to play stereo content on multichannel speakers can be categorized into two kinds. The first uses matrix [gerzon1991optimal] to map the left and right content channels into each of the surround speakers. These operate in different sound modes, e.g. direct, stereo, and all channel stereo mode, without sources decomposition. The second approach relies on the separation of sound components from the stereo mixture, e.g., into music content and other ambient sounds [6082279, faller2006multiple] or into individual musical instruments [jeon2010robust]

that are then placed in the desired panning locations. There are different techniques proposed to the primary-ambient separation, including frequency-domain inter-channel coherence index computation

[5745013, avendano2002frequency, merimaa2007correlation-based], time or frequency-domain adaptive-filtering [irwan2002two, usher2007enhancement], PCA-based primary-ambient decomposition [baek2012efficient, ibrahim2016primary]

, and deep neural network (DNN)-based methods

[uhle2008supervised, ibrahim2018primary, choi2021exploiting]. Perceived azimuth direction has also proven to be beneficial for direct and ambient separation [kraft2015stereo]. Jeon et al. proposed an upmixing algorithm based on robust source separation with a post-scaling algorithm to compensate for the interference [jeon2010robust]. Another DNN-based upmixing model, introduced by Park [park2016subband]

, bypasses any explicit decomposition and generates the center and surround channels from stereo channels directly based on non-linear transformations learned in the DNN.

In this paper, we develop a generative model for upmixing. The goal is to implicitly separate the instrumental components contained in the stereo signal and rearrange them into the five-channel setting that renders the desired perceptual source directions. This setting of the problem leads to the fact that there can be many legitimate upmixed versions that correspond to a stereo signal, thus making optimization ill-defined on this one-to-many mapping function .

To address this issue, we firstly make a basic assumption, that the spatial images of a multi-channel music signal can be represented independently of the musical content. This means that two different 5-channel music signals with the same instrumentation and panning will map to an identical spatial representation even though the content—melody, harmony, etc—differs. Then, we hope to extract the source-specific spatial representation from the multichannel audio and utilize the learned representation to guide the test-time upmixing.To this end, we propose a modified variational autoencoder (VAE) model [KingmaD2014vae] and train it in a supervised manner. Similar to the ordinary VAE training, we use a 5-channel signal as both input and target. However, we also provide a stereo version of the 5-channel signal as input to the decoder and train the VAE bottleneck layer to exclusively capture the spatial images of the sources rather than an entangled representation of music components and spatial information.

We empirically show that the learned latent representation reflects the spatial images of the multichannel input, i.e., it is correlated to the panning strategy used to render the 5-channel input, and is invariant to the music content. Hence, the decoder of the trained VAE can generate the upmixed version of the input stereo with the guidance provided in the form of the latent variables. Moreover, we also show that panning information represented in the latent space can be transferred in-between different songs, enabling spatial style transfer.

2 Upmixing: a probabilistic formulation

We firstly describe a general probabilistic upmixing model where spatial information and the stereo music signal are still entangled. Then, we showcase how the assumption that spatial information and stereo audio are independent alters the proposed formulation. We further introduce an alternative deterministic downmixing process, to achieve the desired disentanglement.

We define the upmixing process probabilistically, as the likelihood of observing the 5-channel sample given the spatial information and stereo audio : , where

stands for the model parameters and the bar notation is the sample of a random variable, e.g.,

. We use superscript to denote the number of channels. However, we cannot compute the likelihood due to the intractable marginal distribution and posterior distribution . VAE employs variational inference to resolve this issue by proposing an encoder function that best approximates the true posterior. Coupled with concept of evidence lower bound (ELBO), the optimization is eventually defined as follows:


where denotes the prior distribution of the latent variables, thus forming a regularization term. Meanwhile, the main reconstruction objective is to maximize the expectation of the log-likelihood w.r.t. the approximated posterior distribution.

The desired disentanglement of stereo music content and spatial information assumes that and are independent from each other. Therefore, the joint posterior distribution can be factorized, i.e., . To further constrain that only reflects stereo music content while represents 5-channel spatial information, we employ a hard regularization trick by replacing the random variable with a simple downmixed version . Then, the probabilistic model is simplified into a deterministic downmixing process, , where denotes the pre-defined downmix function. We treat as a constant and rewrite eq. (1) as follows:


The loss function represents our proposed disentanglement method. As a modified VAE, the spatial information is learned via

while the decoder takes in extra stereo input to generate the multi-channel output

. The second term regularizes each latent variable by a standard normal distribution, i.e.,

, where is a

dimensional identical matrix.

Figure 1: Model architecture

3 Model Description

3.1 DNN architecture

Fig. 1 summarizes the overall model architecture. We employ densely connected convolutional blocks introduced in DenseNet [HuangG2017densenet] as our building blocks in both the encoder and the decoder. In both modules, we alternate 5 dense blocks with 4 transition layers. A dense block contains 5 convolutional layers, each of which produces

output channels. We set the stride to be 2 for the transition layers in the encoder to reduce the data dimension. During training, the encoder takes as input a 5-channel audio in the form of stacked magnitude spectrograms,

, where and are the frequency subbands and time-domain frames. Here, we introduce a subscript to denote the particular panning strategy for creating a 5-channel input signal. The encoder outputs two

-dimensional vectors

, which are the parameters of the -dimensional multivariate latent normal distribution. Latent variables are then sampled using the reparameterization technique [KingmaD2014vae], i.e., where .

The stacked stereo spectrogram, , is a unique feature of the proposed model. It is input to the decoder to provide the music signal for upmixing. During training, we first prepare a 5-channel signal under a panning condition , from which the encoder learns the latent vector . Meanwhile, the same music stems are rearranged into another 5-channel version using a different panning condition , whose stereo downmix is fed to the decoder. We rearrange the source locations in rather than reusing , to ensure that the left- and right panning in the stereo is unrelated to the left- and right panning in the 5-channel signal.

We repeat and stack to at the concatenation step to condition the spatial information on the stereo input. The transition layer that receives the stereo

does not change the tensor size, which is

. After repeating the values of -dimensional embedding vector at all bins, the input to the decoder is of size .

3.2 The Test-Time Generation Processes

For test-time generation, the decoder operates as the upmix generator, which takes stereo signal and the spatial information as a seed. Our model can accommodate two use-case scenarios:

  • [noitemsep,topsep=0pt, leftmargin=0in, itemindent=.15in]

  • Style transfer-based upmixing: We posit that panning style transfer from one music piece to another is possible. Here we define the panning style as the set of apparent source directions for each instrument. In this case, the user provides two input signals: as the source 5-channel audio with the panning style and as the stereo music to upmix. Note that during the test time we do not specify the panning method of as it doesn’t influence the process. The actual music in and differs and the aim of the generator is to create a 5-channel audio which is upmixed from and has the same panning method as . The proposed style transfer is conducted by sharing the latent variables in style extraction (i.e. encoding) and upmixing (i.e. decoding) processes. The encoder takes in and learns which encodes the source’s spatial panning. The generator takes in stereo seed signal and , and synthesizes the multi-channel version of the target . This indirect way enables a user interface where the user handles spatial control by providing an example song.

  • Blind upmixing: The blind upmixing method literally “generates” spatial panning information from the learned latent variables and applies it to the seed stereo signal . In our VAE, the generation process begins with a random sample only, hence the spatial image of the generated 5-channel audio sample is out of the user’s control. In the blind upmixing scenario, the sampling process is completely random without involving any intuition about the latent space.

3.3 The Baseline

Although blind upmixing has been studied extensively in the literature, our approach to upmixing via style transfer is novel. Hence, we are unable to use any existing model as a baseline. Instead, we build a simple baseline model by spreading each channel of to the front and rear channels of the same side in the 5-channel output. This is the same as the “all channel stereo” upmixing mode commonly used by home receivers. It is a straightforward way to perform upmixing and can perfectly preserve the sound quality of the stereo input because the process does not generate any artifacts. By comparing with the baseline model, we mainly seek to prove the functionality of our model on style-transfer upmixing.

4 Experimental Setup

4.1 Vector-Based Amplitude Panning for Data Preparation

To train and evaluate the model, it is necessary for us to know the panning method that is used to create the 5-channel audio, so as to examine whether the information captured in the latent space is correlated to the correct spatial images that we want to learn. Also, when having stem sources, we can expose the model to various artificial upmixing configurations where source locations change freely. However, such ground-truth spatial maps are not readily available for most 5-channel music signals we have access to.

To that end, we choose to build our own 5-channel dataset from individual musical instruments via the vector-based amplitude panning (VBAP) method [pulkki1997virtual]. VBAP provides an efficient equation for virtual sound source positioning, enabling control of an unlimited number of loudspeakers in an arbitrary two- or three-dimensional placement around the listener. In this paper, we employ a two-dimensional rendering space with 5 speakers, following the ITU’s 5.1 channel standard [ITU5.1] without the subwoofer, which is not considered in our upmix algorithm. For each instrument, we specify a virtual source direction. Then, we pan each instrument independently using the two adjacent speakers to the desired coming direction, based on the vector base formulation [pulkki1997virtual].

4.2 Datasets

We build a synthesized 5-channel dataset using the MUSDB18 dataset [musdb18-hq]. MUSDB18 provides pop songs in stereo format and four separated instrumental stems: vocals, drums, bass, and other. We split data set into training, validation, and testing subsets, which amount to about 5, 0.5, and 1.5 hours, respectively. The 5-channel versions are created using the VBAP method described in Sec. 4.1 by controlling the stem tracks. Stereo input signals are rendered by applying passive downmixing on the synthesized 5-channel, i.e., , where and . The order of the five channels in are : front left, real left, center, front right, and real right.

We mainly run our experiments on the synthesized MUSDB18 dataset. However, when validating the blind upmixing model, we employ another internal real-world dataset of professionally mixed surround music with approximately 17 hours for training, 2.5 hours for validation, and 4 hours for testing, in order to test our model’s generalization ability on real music data.

4.3 Training Setup

The directions of sources are randomly sampled from the entire

circle. The models are trained on 2.2 second-long segments, which have at least one instrument not silent. Each segment is processed by short-time Fourier transform (STFT) on Hann-windowed frames of 1024 samples with a 75% overlap, resulting in a spectrogram of size

. We apply the phase of the stereo input to recover 5-channel magnitude spectrograms. The left channel’s phase in stereo is used for both front and rear left channels in the 5-channel audio, and similarly for the right channels; the mean of the left and right channels is used for the center channel recovery. We use the Adam optimizer with an initial learning rate of 0.005 [KingmaD2015adam].

4.4 Objective Evaluation Methods

We evaluate our style transfer-based upmixing model against the baseline using various objective metrics as follows:

  • [itemsep=1pt,topsep=0pt,leftmargin=*]

  • Overall reconstruction quality: Scale-dependent source-to-distortion ratio (SD-SDR) is employed to report the overall reconstruction quality by comparing the time-domain reconstruction and the ground-truth as SD-SDR is proven to better reflect the scale reconstruction compared to ordinary SDR [LeRouxJL2018sisdr].

  • Spatial reconstruction quality: To validate the spatial reconstruction quality, we propose a new metric using the Wasserstein distance on interchannel level differences (WILD). Given the 5-channel magnitude spectrogram , we compute an ILD matrix for each pair of channels , e.g., . The five channels result in ten such ILD matrices in total, i.e.,

    . The Wasserstein distance computes on the histograms of target and estimation ILD matrices, i.e.,

    . Note that the Wasserstein distance improves the robustness of the comparison when the two histograms are too different from each other, where KL-divergence fails to quantify the dissimilarity.

  • Average angle difference: We use the least square algorithm to approximately decompose each of the 5 channels into a linear combination of four instrumental sources. Based on the relative amount of each instrument spread in the 5 channels, we can estimate the virtual location of the instrument. We report the difference between the ground-truth and estimated direction of the upmixed instruments in terms of angles in degree.

4.5 Subjective Evaluations

We conduct an ABX test to subjectively evaluate the proposed spatial style transfer algorithm. The ground-truth 5-channel version is provided to the participants as the reference. Participants are asked to choose the one that is more similar to between the style-transferred reconstruction and the baseline upmix , in terms of the incoming direction of the four sources separately, as well as the overall spatial image. The test contains 5 trials and is conducted in a professional surround-sound listening room.

5 Experimental Results

5.1 Analysis on the Learnt Latent Space

Figure 2: Visualization of latent variable activations.

To validate the amount of information disentanglement, we analyze the statistical properties of the learned latent variables , and whether the latent variables learn spatial information independently from music content. In Fig. 1(a), we show the activation of the mean of the latent distribution from two different panning configurations of the same song. The two uncorrelated graphs indicate that the dimensions learned in differ from each other although the two versions share the exact same stem tracks. On the contrary, Fig. 1(b) shows highly correlated activation of two completely different songs, because their corresponding latent variables succeed to learn the same representation based on their same panning locations. These results suggest that the learned latent variables are influenced more by the spatial arrangement of the stem tracks and not by music content, achieving the desired disentanglement.

Another visualization of the latent variables further strengthens our claim. We prepare a set of data samples, which are created from the combination of five different songs, and five different spatial configurations. If the latent space reflects the spatial information more than the music content, there must be a latent structure that reflects the spatial characteristics more. In Fig. 1(c) and 1(d)

, we show a dimension-reduced latent space via principal component analysis. In

1(c), the coloring is based on the songs, while in 1(d) it is based on the spatial configurations. We can see that the segments from the same spatial configuration form a cluster in 1(d)

, while segments that belong to the same song are scattered everywhere. The average standard deviation for each color group in

1(c) is relatively higher than the ones in 1(d), i.e., . This observation again indicates that the encoder extracts music-invariant spatial features successfully.

5.2 Style Transfer-Based Upmixing Results

We conduct the style transfer-based upmixing and compare the model’s performance against the baseline model. To this end, we prepare 540 test examples that are 10 seconds long and are rendered in one of the 20 arbitrarily defined panning configurations.


SD-SDR (dB) Angle () WILD
Style-transfer 8.711.50 19.515 58.0213.78
Baseline 4.200.71 49.7519.75 70.8410.03


Table 1: Objective performance of the style transfer method.
  • [noitemsep,topsep=0pt, leftmargin=0in, itemindent=.15in]

  • Objective evaluations: Table 1

    summarizes the objective evaluation results. Our proposed style transfer-based upmixing method shows an obvious advantage over the baseline upmixing model with a better reconstruction score, less difference in angle, and WILD. It is noteworthy that the style-transferred results bear higher SD-SDR than the fixed-panning-based baseline, even in the presence of the machine learning model’s algorithmic artifact. The SD-SDR score indicates that our model can properly fulfill the mission of arbitrarily positioning instruments during 5-channel generation without significant harm to the sound quality.

  • Subjective evaluations: Fig. 3 reports the percentage of subjects preferring the proposed style transfer-based upmixing over the baseline, regarding the direction similarity. Ten listening experts participated in our subjective test. More than 80% of participants believe the proposed model is more similar to the reference, in terms of the overall perceptual panning. When it comes to the individual instruments, the reconstruction of vocals is the most favored, which might account for the overall performance due to its dominance in pop music. Our model gains preference from the listeners on drums and other

    , although with higher variance on the latter one, while the advantage doesn’t pass to

    bass. We believe that the different preference on different instruments results from the fact that the decoder implicitly performs instrument separation and relocation for estimated sources: when an instrument is difficult to separate from the mixture, its relative scale in each channel and the resulting perceived direction in the 5-channel may be negatively affected, which also explains the phenomenon that our preference scores align to the diverse instrument-specific separation performances on the MUSDB18 dataset in the literature [StollerD2018waveunet].

5.3 Blind upmixing and other experiments

To test our model’s potential as a generator, we apply blind upmixing to MUSDB18 signals and to real-world music signals from our internal dataset. Because all the models are initially trained on MUSDB18, to let the model better accommodate real-world music signals, we fine-tune the model with real-world 5-channel music before blindly generating 5-channel examples. Due to the fact that the blind upmixing process is absolutely random, we cannot validate its quality and the surrounding effect via a quantitative measure. However, we find that the upmixing results are convincing based on informal listening by our peers. We provide some blind upmixing samples, both from MUSDB18 and the real-world dataset at

We evaluated different latent dimensions ranging from to , while there is no strong enough trend to recommend a single number. We eventually chose for all the experiments.

Figure 3: Subjective tests on the rendered multichannel samples.

6 Conclusion

In this paper, we proposed to tackle the stereo upmixing problem in a generative way, where spatial images and music content in the 5-channel signal are disentangled to allow style transfer-based upmixing. We formulated the problem into a modified VAE model and train the latent space to capture music-invariant spatial information, e.g., panning locations. Our experiments showed that the learned latent variables successfully capture spatial information separately from the music contents. Both the objective and subjective evaluations demonstrate that our style transfer-based upmixing model achieved interactive upmixing.