1 Introduction
While home theater and surround sound have become more common and affordable, most music is still in stereo. Approaches to play stereo content on multichannel speakers can be categorized into two kinds. The first uses matrix [gerzon1991optimal] to map the left and right content channels into each of the surround speakers. These operate in different sound modes, e.g. direct, stereo, and all channel stereo mode, without sources decomposition. The second approach relies on the separation of sound components from the stereo mixture, e.g., into music content and other ambient sounds [6082279, faller2006multiple] or into individual musical instruments [jeon2010robust]
that are then placed in the desired panning locations. There are different techniques proposed to the primaryambient separation, including frequencydomain interchannel coherence index computation
[5745013, avendano2002frequency, merimaa2007correlationbased], time or frequencydomain adaptivefiltering [irwan2002two, usher2007enhancement], PCAbased primaryambient decomposition [baek2012efficient, ibrahim2016primary], and deep neural network (DNN)based methods
[uhle2008supervised, ibrahim2018primary, choi2021exploiting]. Perceived azimuth direction has also proven to be beneficial for direct and ambient separation [kraft2015stereo]. Jeon et al. proposed an upmixing algorithm based on robust source separation with a postscaling algorithm to compensate for the interference [jeon2010robust]. Another DNNbased upmixing model, introduced by Park et.al [park2016subband], bypasses any explicit decomposition and generates the center and surround channels from stereo channels directly based on nonlinear transformations learned in the DNN.
In this paper, we develop a generative model for upmixing. The goal is to implicitly separate the instrumental components contained in the stereo signal and rearrange them into the fivechannel setting that renders the desired perceptual source directions. This setting of the problem leads to the fact that there can be many legitimate upmixed versions that correspond to a stereo signal, thus making optimization illdefined on this onetomany mapping function .
To address this issue, we firstly make a basic assumption, that the spatial images of a multichannel music signal can be represented independently of the musical content. This means that two different 5channel music signals with the same instrumentation and panning will map to an identical spatial representation even though the content—melody, harmony, etc—differs. Then, we hope to extract the sourcespecific spatial representation from the multichannel audio and utilize the learned representation to guide the testtime upmixing.To this end, we propose a modified variational autoencoder (VAE) model [KingmaD2014vae] and train it in a supervised manner. Similar to the ordinary VAE training, we use a 5channel signal as both input and target. However, we also provide a stereo version of the 5channel signal as input to the decoder and train the VAE bottleneck layer to exclusively capture the spatial images of the sources rather than an entangled representation of music components and spatial information.
We empirically show that the learned latent representation reflects the spatial images of the multichannel input, i.e., it is correlated to the panning strategy used to render the 5channel input, and is invariant to the music content. Hence, the decoder of the trained VAE can generate the upmixed version of the input stereo with the guidance provided in the form of the latent variables. Moreover, we also show that panning information represented in the latent space can be transferred inbetween different songs, enabling spatial style transfer.
2 Upmixing: a probabilistic formulation
We firstly describe a general probabilistic upmixing model where spatial information and the stereo music signal are still entangled. Then, we showcase how the assumption that spatial information and stereo audio are independent alters the proposed formulation. We further introduce an alternative deterministic downmixing process, to achieve the desired disentanglement.
We define the upmixing process probabilistically, as the likelihood of observing the 5channel sample given the spatial information and stereo audio : , where
stands for the model parameters and the bar notation is the sample of a random variable, e.g.,
. We use superscript to denote the number of channels. However, we cannot compute the likelihood due to the intractable marginal distribution and posterior distribution . VAE employs variational inference to resolve this issue by proposing an encoder function that best approximates the true posterior. Coupled with concept of evidence lower bound (ELBO), the optimization is eventually defined as follows:(1) 
where denotes the prior distribution of the latent variables, thus forming a regularization term. Meanwhile, the main reconstruction objective is to maximize the expectation of the loglikelihood w.r.t. the approximated posterior distribution.
The desired disentanglement of stereo music content and spatial information assumes that and are independent from each other. Therefore, the joint posterior distribution can be factorized, i.e., . To further constrain that only reflects stereo music content while represents 5channel spatial information, we employ a hard regularization trick by replacing the random variable with a simple downmixed version . Then, the probabilistic model is simplified into a deterministic downmixing process, , where denotes the predefined downmix function. We treat as a constant and rewrite eq. (1) as follows:
(2) 
The loss function represents our proposed disentanglement method. As a modified VAE, the spatial information is learned via
while the decoder takes in extra stereo input to generate the multichannel output. The second term regularizes each latent variable by a standard normal distribution, i.e.,
, where is adimensional identical matrix.
3 Model Description
3.1 DNN architecture
Fig. 1 summarizes the overall model architecture. We employ densely connected convolutional blocks introduced in DenseNet [HuangG2017densenet] as our building blocks in both the encoder and the decoder. In both modules, we alternate 5 dense blocks with 4 transition layers. A dense block contains 5 convolutional layers, each of which produces
output channels. We set the stride to be 2 for the transition layers in the encoder to reduce the data dimension. During training, the encoder takes as input a 5channel audio in the form of stacked magnitude spectrograms,
, where and are the frequency subbands and timedomain frames. Here, we introduce a subscript to denote the particular panning strategy for creating a 5channel input signal. The encoder outputs twodimensional vectors
, which are the parameters of the dimensional multivariate latent normal distribution. Latent variables are then sampled using the reparameterization technique [KingmaD2014vae], i.e., where .The stacked stereo spectrogram, , is a unique feature of the proposed model. It is input to the decoder to provide the music signal for upmixing. During training, we first prepare a 5channel signal under a panning condition , from which the encoder learns the latent vector . Meanwhile, the same music stems are rearranged into another 5channel version using a different panning condition , whose stereo downmix is fed to the decoder. We rearrange the source locations in rather than reusing , to ensure that the left and right panning in the stereo is unrelated to the left and right panning in the 5channel signal.
We repeat and stack to at the concatenation step to condition the spatial information on the stereo input. The transition layer that receives the stereo
does not change the tensor size, which is
. After repeating the values of dimensional embedding vector at all bins, the input to the decoder is of size .3.2 The TestTime Generation Processes
For testtime generation, the decoder operates as the upmix generator, which takes stereo signal and the spatial information as a seed. Our model can accommodate two usecase scenarios:

[noitemsep,topsep=0pt, leftmargin=0in, itemindent=.15in]

Style transferbased upmixing: We posit that panning style transfer from one music piece to another is possible. Here we define the panning style as the set of apparent source directions for each instrument. In this case, the user provides two input signals: as the source 5channel audio with the panning style and as the stereo music to upmix. Note that during the test time we do not specify the panning method of as it doesn’t influence the process. The actual music in and differs and the aim of the generator is to create a 5channel audio which is upmixed from and has the same panning method as . The proposed style transfer is conducted by sharing the latent variables in style extraction (i.e. encoding) and upmixing (i.e. decoding) processes. The encoder takes in and learns which encodes the source’s spatial panning. The generator takes in stereo seed signal and , and synthesizes the multichannel version of the target . This indirect way enables a user interface where the user handles spatial control by providing an example song.

Blind upmixing: The blind upmixing method literally “generates” spatial panning information from the learned latent variables and applies it to the seed stereo signal . In our VAE, the generation process begins with a random sample only, hence the spatial image of the generated 5channel audio sample is out of the user’s control. In the blind upmixing scenario, the sampling process is completely random without involving any intuition about the latent space.
3.3 The Baseline
Although blind upmixing has been studied extensively in the literature, our approach to upmixing via style transfer is novel. Hence, we are unable to use any existing model as a baseline. Instead, we build a simple baseline model by spreading each channel of to the front and rear channels of the same side in the 5channel output. This is the same as the “all channel stereo” upmixing mode commonly used by home receivers. It is a straightforward way to perform upmixing and can perfectly preserve the sound quality of the stereo input because the process does not generate any artifacts. By comparing with the baseline model, we mainly seek to prove the functionality of our model on styletransfer upmixing.
4 Experimental Setup
4.1 VectorBased Amplitude Panning for Data Preparation
To train and evaluate the model, it is necessary for us to know the panning method that is used to create the 5channel audio, so as to examine whether the information captured in the latent space is correlated to the correct spatial images that we want to learn. Also, when having stem sources, we can expose the model to various artificial upmixing configurations where source locations change freely. However, such groundtruth spatial maps are not readily available for most 5channel music signals we have access to.
To that end, we choose to build our own 5channel dataset from individual musical instruments via the vectorbased amplitude panning (VBAP) method [pulkki1997virtual]. VBAP provides an efficient equation for virtual sound source positioning, enabling control of an unlimited number of loudspeakers in an arbitrary two or threedimensional placement around the listener. In this paper, we employ a twodimensional rendering space with 5 speakers, following the ITU’s 5.1 channel standard [ITU5.1] without the subwoofer, which is not considered in our upmix algorithm. For each instrument, we specify a virtual source direction. Then, we pan each instrument independently using the two adjacent speakers to the desired coming direction, based on the vector base formulation [pulkki1997virtual].
4.2 Datasets
We build a synthesized 5channel dataset using the MUSDB18 dataset [musdb18hq]. MUSDB18 provides pop songs in stereo format and four separated instrumental stems: vocals, drums, bass, and other. We split data set into training, validation, and testing subsets, which amount to about 5, 0.5, and 1.5 hours, respectively. The 5channel versions are created using the VBAP method described in Sec. 4.1 by controlling the stem tracks. Stereo input signals are rendered by applying passive downmixing on the synthesized 5channel, i.e., , where and . The order of the five channels in are : front left, real left, center, front right, and real right.
We mainly run our experiments on the synthesized MUSDB18 dataset. However, when validating the blind upmixing model, we employ another internal realworld dataset of professionally mixed surround music with approximately 17 hours for training, 2.5 hours for validation, and 4 hours for testing, in order to test our model’s generalization ability on real music data.
4.3 Training Setup
The directions of sources are randomly sampled from the entire
circle. The models are trained on 2.2 secondlong segments, which have at least one instrument not silent. Each segment is processed by shorttime Fourier transform (STFT) on Hannwindowed frames of 1024 samples with a 75% overlap, resulting in a spectrogram of size
. We apply the phase of the stereo input to recover 5channel magnitude spectrograms. The left channel’s phase in stereo is used for both front and rear left channels in the 5channel audio, and similarly for the right channels; the mean of the left and right channels is used for the center channel recovery. We use the Adam optimizer with an initial learning rate of 0.005 [KingmaD2015adam].4.4 Objective Evaluation Methods
We evaluate our style transferbased upmixing model against the baseline using various objective metrics as follows:

[itemsep=1pt,topsep=0pt,leftmargin=*]

Overall reconstruction quality: Scaledependent sourcetodistortion ratio (SDSDR) is employed to report the overall reconstruction quality by comparing the timedomain reconstruction and the groundtruth as SDSDR is proven to better reflect the scale reconstruction compared to ordinary SDR [LeRouxJL2018sisdr].

Spatial reconstruction quality: To validate the spatial reconstruction quality, we propose a new metric using the Wasserstein distance on interchannel level differences (WILD). Given the 5channel magnitude spectrogram , we compute an ILD matrix for each pair of channels , e.g., . The five channels result in ten such ILD matrices in total, i.e.,
. The Wasserstein distance computes on the histograms of target and estimation ILD matrices, i.e.,
. Note that the Wasserstein distance improves the robustness of the comparison when the two histograms are too different from each other, where KLdivergence fails to quantify the dissimilarity. 
Average angle difference: We use the least square algorithm to approximately decompose each of the 5 channels into a linear combination of four instrumental sources. Based on the relative amount of each instrument spread in the 5 channels, we can estimate the virtual location of the instrument. We report the difference between the groundtruth and estimated direction of the upmixed instruments in terms of angles in degree.
4.5 Subjective Evaluations
We conduct an ABX test to subjectively evaluate the proposed spatial style transfer algorithm. The groundtruth 5channel version is provided to the participants as the reference. Participants are asked to choose the one that is more similar to between the styletransferred reconstruction and the baseline upmix , in terms of the incoming direction of the four sources separately, as well as the overall spatial image. The test contains 5 trials and is conducted in a professional surroundsound listening room.
5 Experimental Results
5.1 Analysis on the Learnt Latent Space


To validate the amount of information disentanglement, we analyze the statistical properties of the learned latent variables , and whether the latent variables learn spatial information independently from music content. In Fig. 1(a), we show the activation of the mean of the latent distribution from two different panning configurations of the same song. The two uncorrelated graphs indicate that the dimensions learned in differ from each other although the two versions share the exact same stem tracks. On the contrary, Fig. 1(b) shows highly correlated activation of two completely different songs, because their corresponding latent variables succeed to learn the same representation based on their same panning locations. These results suggest that the learned latent variables are influenced more by the spatial arrangement of the stem tracks and not by music content, achieving the desired disentanglement.
Another visualization of the latent variables further strengthens our claim. We prepare a set of data samples, which are created from the combination of five different songs, and five different spatial configurations. If the latent space reflects the spatial information more than the music content, there must be a latent structure that reflects the spatial characteristics more. In Fig. 1(c) and 1(d)
, we show a dimensionreduced latent space via principal component analysis. In
1(c), the coloring is based on the songs, while in 1(d) it is based on the spatial configurations. We can see that the segments from the same spatial configuration form a cluster in 1(d), while segments that belong to the same song are scattered everywhere. The average standard deviation for each color group in
1(c) is relatively higher than the ones in 1(d), i.e., . This observation again indicates that the encoder extracts musicinvariant spatial features successfully.5.2 Style TransferBased Upmixing Results
We conduct the style transferbased upmixing and compare the model’s performance against the baseline model. To this end, we prepare 540 test examples that are 10 seconds long and are rendered in one of the 20 arbitrarily defined panning configurations.


SDSDR (dB)  Angle ()  WILD  
Styletransfer  8.711.50  19.515  58.0213.78 
Baseline  4.200.71  49.7519.75  70.8410.03 


[noitemsep,topsep=0pt, leftmargin=0in, itemindent=.15in]

Objective evaluations: Table 1
summarizes the objective evaluation results. Our proposed style transferbased upmixing method shows an obvious advantage over the baseline upmixing model with a better reconstruction score, less difference in angle, and WILD. It is noteworthy that the styletransferred results bear higher SDSDR than the fixedpanningbased baseline, even in the presence of the machine learning model’s algorithmic artifact. The SDSDR score indicates that our model can properly fulfill the mission of arbitrarily positioning instruments during 5channel generation without significant harm to the sound quality.

Subjective evaluations: Fig. 3 reports the percentage of subjects preferring the proposed style transferbased upmixing over the baseline, regarding the direction similarity. Ten listening experts participated in our subjective test. More than 80% of participants believe the proposed model is more similar to the reference, in terms of the overall perceptual panning. When it comes to the individual instruments, the reconstruction of vocals is the most favored, which might account for the overall performance due to its dominance in pop music. Our model gains preference from the listeners on drums and other
, although with higher variance on the latter one, while the advantage doesn’t pass to
bass. We believe that the different preference on different instruments results from the fact that the decoder implicitly performs instrument separation and relocation for estimated sources: when an instrument is difficult to separate from the mixture, its relative scale in each channel and the resulting perceived direction in the 5channel may be negatively affected, which also explains the phenomenon that our preference scores align to the diverse instrumentspecific separation performances on the MUSDB18 dataset in the literature [StollerD2018waveunet].
5.3 Blind upmixing and other experiments
To test our model’s potential as a generator, we apply blind upmixing to MUSDB18 signals and to realworld music signals from our internal dataset. Because all the models are initially trained on MUSDB18, to let the model better accommodate realworld music signals, we finetune the model with realworld 5channel music before blindly generating 5channel examples. Due to the fact that the blind upmixing process is absolutely random, we cannot validate its quality and the surrounding effect via a quantitative measure. However, we find that the upmixing results are convincing based on informal listening by our peers. We provide some blind upmixing samples, both from MUSDB18 and the realworld dataset at https://saige.sice.indiana.edu/researchprojects/generativeupmixing.
We evaluated different latent dimensions ranging from to , while there is no strong enough trend to recommend a single number. We eventually chose for all the experiments.
6 Conclusion
In this paper, we proposed to tackle the stereo upmixing problem in a generative way, where spatial images and music content in the 5channel signal are disentangled to allow style transferbased upmixing. We formulated the problem into a modified VAE model and train the latent space to capture musicinvariant spatial information, e.g., panning locations. Our experiments showed that the learned latent variables successfully capture spatial information separately from the music contents. Both the objective and subjective evaluations demonstrate that our style transferbased upmixing model achieved interactive upmixing.