Automatic DJ Transitions with Differentiable Audio Effects and Generative Adversarial Networks

by   Bo-Yu Chen, et al.

A central task of a Disc Jockey (DJ) is to create a mixset of mu-sic with seamless transitions between adjacent tracks. In this paper, we explore a data-driven approach that uses a generative adversarial network to create the song transition by learning from real-world DJ mixes. In particular, the generator of the model uses two differentiable digital signal processing components, an equalizer (EQ) and a fader, to mix two tracks selected by a data generation pipeline. The generator has to set the parameters of the EQs and fader in such away that the resulting mix resembles real mixes created by humanDJ, as judged by the discriminator counterpart. Result of a listening test shows that the model can achieve competitive results compared with a number of baselines.



There are no comments yet.


page 2

page 3


MCL-GAN: Generative Adversarial Networks with Multiple Specialized Discriminators

We propose a generative adversarial network with multiple discriminators...

MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation

Most existing neural network models for music generation use recurrent n...

Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization

In a recent paper, we have presented a generative adversarial network (G...

FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos

Deep learning based visual to sound generation systems essentially need ...

Image-based model parameter optimisation using Model-Assisted Generative Adversarial Networks

We propose and demonstrate the use of a Model-Assisted Generative Advers...

Why do you think that? Exploring Faithful Sentence-Level Rationales Without Supervision

Evaluating the trustworthiness of a model's prediction is essential for ...

Downhole Track Detection via Multiscale Conditional Generative Adversarial Nets

Frequent mine disasters cause a large number of casualties and property ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent years witnessed growing interest in building automatic DJ systems. Research has been done to not only computationally analyze existing mixes made by human DJs [reverse_dj, computational_dj, reverse_dj_transition], but also develop automatic models that mimic how DJs create a mix [djnet, drum&bass, spotify_transition, highlight_dj, reinforcement_dj]. We are in particular interested in the automation of DJ transition making, for it involves expert knowledge of DJ equipment and music mixing and represents one of the most important factors for shaping DJ styles.

As depicted in Figure 1, the DJ transition making process is composed of at least two parts. Given a pair of audio files and , cue point selection decides the “cue out” time () where the first track becomes inaudible, and the “cue in” time () where the second track becomes audible. Then, mixer controlling applies audio effects such as EQ to the portion of and where they overlap (i.e., the transition region), so that the resulting mix contains a smooth transition of the tracks. While cue point selection has been more often studied in the literature [spotify_transition, cue_select], automatic mixer controlling for DJ transition generation remains much unexplored. Existing methods are mostly based on hand-crafted rules [drum&bass, highlight_dj], which may not work well for all musical materials. We aim to automate this process by exploring for the first time a data-driven approach to transition generation, learning to DJing directly from real-world data.

Figure 1: Illustration of the DJ transition making process.

To achieve this, a supervised approach would require the availability of the original song pairs () as model input and the corresponding DJ-made mix

as the target output. Preparing such a training set is time-consuming for we need to collect the original songs involved in mixes, detect the cue points, and properly segment the original songs. To get around this, we resort to a generative adversarial network (GAN) approach

[gan] that requires only a collection of real-world mixes (as real data) and a separate collection of song pairs (called “paired tracks in Figure 1) that do not need to correspond to the real-world mixes (as input to our generator). We use the principle of GAN to learn how to mix the input paired tracks to create realistic transitions.

Moreover, instead of using arbitrary deep learning layers to build the generator, based on differentiable digital signal processing (DDSP) [ddsp_synthesizer] we propose a novel and light-weight differentiable EQ and fader layers as the core of our generator. This provides a strong inductive bias as the job of the generator now boils down to determining the parameters of the EQ and fader to be applied respectively to the paired tracks. While DDSP-like components has been applied to audio synthesis [ddsp_synthesizer, ddsp_singing], audio effect modeling [ddsp_iir, ddsp_blackbox, ddsp_distortion], and automatic mixing [ddsp_mixing], they have not been used for transition generation, to our best knowledge. We refer to our model as DJtransGAN.

For model training, we develop a data generation pipeline to prepare input paired tracks from the MTG-Jamendo dataset [mtg_jamendo], and collect real-world mixes from livetracklist [livetracklist]. We report a subject evaluation of DJtransGAN by comparing it with baseline methods. Examples of the generated mixes can be found online at

2 Differentiable DJ Mixer

EQs and faders are two essential components in the DJ-made mixing effects. To make an automatic DJ mixer that is both differentiable and is capable of producing mixing effects, we incorporate DDSP components that resemble EQs and faders in our network.

We consider audio segments of paired tracks

, whose lengths have been made identical by zero-padding

at the back after and zero-padding at the front before . We denote their STFT spectrograms as , in which they have frames and frequency bins. Two time-frequency masks , parameterized by and , are used to represent the mixing effects to be applied to and . The effects are applied via element-wise product, i.e. and . The two time-frequency masks and represent the DJ mixing effect of fade-out and fade-in, respectively. The goal is to generate and such that we can get a proper DJ mix by reverting back to the time domain and take their summations as the output .

Figure 2: Differentiable DJ Mixer, illustrating a 2-band EQ.

2.1 Differentiable Fader

Being inspired by [sigmoid_fader, mobilnet]

, we combine two clipped ReLU functions as the basic template of a fading curve, called

. We parameterize by its starting time and the slope of its transition region, as formulated below:


where , and respectively correspond to the starting time, the slope and a linear increasing sequence ranged from 0 to 1 with points in the time axis. Accordingly, the fading curves for fade-in and fade-out can be formulated as:


To ensure that the fading occurs only inside of the transition region, we add two extra parameters and , where , and impose a constraint that . This is similar to the cue button function in an ordinary DJ mixer. Likewise, we impose to prevent the sound volume from reaching the maximum after . We note that and can be solely determined by and ; we refer to these two parameters collectively as afterwards.

2.2 Differentiable Equalizer

EQs are used to amplify or attenuate the volume of specific frequency bands in a track. Our network achieves the effect of EQ by decomposing audio into sub-bands with several low-pass filters [subband_separation], and then using faders to adjust the volume of each sub-band.

There are several ways to implement a differentiable low-pass filter, such as via FIR filters [ddsp_synthesizer], IIR filters [ddsp_iir], or a frequency sampling approach [ddsp_eq, ddsp_distortion]

. However, all of them have limitations. FIR filter requires learning a long impulse response in this case. IIR filter is recurrent, which leads to computational inefficiency. The frequency sampling approach is relatively efficient but requires an inverse FFT every mini-batch during model training. To avoid these issues, we propose to calculate loss in the time-frequency domain to reduce the number of inverse FFT, and we apply fade-out curves along the frequency axis to achieve low-pass filtering. This can be formulated as:


where is the proposed low-pass filter, which can be thought of as “a fade-out curve in the frequency domain.” The two parameters and serve similar purposes as the cutoff frequency and the Q factor, which are typical parameters of time-domain filters. is a linear increasing sequence ranged from 0 to 1 with points in the frequency axis. Moreover, to constrain the filtering within a proper frequency band, we introduce and and require and . As can be solely parameterized by and , we refer to them collectively as .

Following the same light, we decompose an input track to in total sub-bands, and we combine multiple low-pass filters to form a series of fading curves in the frequency domain. We denote the th fading curve as and the th low-pass filter along with its band limitations as , and , with . The th filter implementing the EQ effect, i.e., , can be accordingly formulated as below. See Figure 2 for an illustration.


2.3 Differentiable Mixer

Now we propose a differentiable mixer, it has sub-bands, and each sub-band has an EQ with its own fader. This means we have two fading-curves for each sub-band, where one works along the time axis and the other along the frequency axis. We denote and as the th fade-in and fade-out curves along the time axis, and and as the th fading curves along the frequency axis. They are parameterized by and .

We construct the final fade-in and fade-out effects, namely and , by summing up the outer products between the corresponding and pairs, as shown below.


In sum, given , , , and , the proposed differentiable mixer has trainable parameters, encompassing and . The remaining challenge is to learn proper parameters for the differentiable mixer to generate feasible DJ mixing effects.

3 Methodology

Figure 3: Schematic plot of the model architecture of the proposed DJtransGAN model for generating DJ transition.

3.1 Dataset

To train our model, we collect a dataset of real-world mixes, and another dataset of paired tracks with artificially generated cue points.

Real DJ transition data. We obtain a collection of DJ mixes by crawling livetracklist, a website hosting DJ mixes [livetracklist]. We select 284 mixes that use the mix tag and based on human-annotated boundaries, and segment the mixes into individual tracks and regard two consecutive tracks as a music section that contains a DJ transition. Sections shorter than 1 minute are discarded. We finally retrieve 7,064 music sections in total.

Data generation pipeline. To generate and the corresponding , we follow a similar pipeline from [drum&bass], which uses most of the domain knowledge of DJ practices. We first compute structure boundaries using the constant-Q transform (CQT)-based “structure feature” algorithm from MSAF [msaf], and detect the beat, downbeat, tempo, and musical keys by Madmom [madmom]. Next, we compile a group of segments by extracting those that are between two structure boundaries and with a length greater than 30 bars of music. For each , we find a suitable to pair with it by firstly selecting 100 candidates that satisfy 1) a bpm difference with no greater than five, 2) a key difference with no greater than two semitones, and 3) originally from a different track. Among the 100 candidate, we identify the best fit by the highest mixability score (with ) as measured by the Music Puzzle Game model [puzzle_game]. Moreover, we pitch-shift and time-stretch to match the pitch and tempo of . To fix the transition region to eight bars [highlight_dj, reverse_dj_transition], we set and as the first downbeat of the last eight bars of and the last downbeat of the first eight bars of , respectively.

We build a training and a testing corpora respectively with 1,000 tracks and 100 tracks from the songs tagged with electronic in the MTG-Jamendo dataset [mtg_jamendo]. In total, we generate 8,318 and 830 paired tracks from each corpus.

Preprocessing. We fix all the inputs to 60 seconds at 44.1 kHz sampling rate. For the real mix dataset, the input is centered to the human-annotated boundary sample. For the data from the generation pipeline, we compute the “cue-mid” points as . When needed, music sections are zero-padded symmetrically.

3.2 Model Architecture

The proposed DJtransGAN model consists of a generator and discriminator , as shown in Figure 3. We use and as the input features for both and during training. The architecture of follows the controller network of the Differentiable Mixing Console proposed by Steinmetz et al. [ddsp_mixing]. The goal is to learn and , the parameters of the the differentiable DJ mixer.

Each controller network encoder learns a feature map from the respective input or . We stack the resulting feature maps and feed them to each context block. The context blocks downscale the output channels from two to one and then directly feed them to the post-processors to predict and . , and are fed to the differentiable DJ mixer to get and . The input to corresponds to the summation of and .

The encoder applies a log-scaled Mel-spectrogram layer followed by three residual convolutional blocks. The residual convolutional blocks contain two convolutional layers with a filter size of 3x3 and strides of 2 and 1, respectively. The three residual convolutional blocks have 4, 8, and 16 filters, respectively. Besides, ReLU and batch normalization are applied after all convolutional layers.

The context block contains a convolutional layer with one filter of size 1x1. The purpose of this block is to provide sufficient cross-information to each post-processor when predicting and

. The post-processor contains three MLP layers. Leaky ReLU and batch normalization are applied after all layers except the last MLP layer, where the Sigmoid function is used. The output dimension of each MLP layer is 1,024, 512, and

, where is the number of bands in the differentiable DJ mixer. The encoder in is identical to that in the controller network. Similarly, the post-processor in is the same, although the dimension of the last MLP layer is set to 2.

Figure 4: The STFT spectrograms of some random mixes generated by the DJtransGAN model.

3.3 Training and Inference

In our implementation, we use 128-bin log-scaled Mel-spectrogram computed by using a 2,048-point Hamming window and 512-point hop-size for STFT. We set the differentiable DJ mixers’ as 4, where we set ’s as 20, 300, 5,000 Hz and as 300, 5,000 and 20,000 Hz, to focus on a low, mid, high-mid and high frequency, respectively. Overall, is trained to use the differentiable DJ mixer by controlling 24 parameters. We choose a min-max loss [gan] as our training objective and train it with the Adam optimizer for 5,298 steps, batch size of 4, and learning rate 1e5 for both and .

As the final step, we apply inverse STFT using the phases and from the inputs and , as shown in Figure 3. We obtain and as a result and sum them together to get , i.e., the waveform of the generated transition. Figure 4 shows examples of .

4 Subjective Evaluation

Figure 5: Result of the subjective evaluation for (left) total people, people (middle) experienced and (right) inexperienced in DJing.

We compare our model GAN with the following four baselines.

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • Sum is a summation of without any effects.

  • Linear applies a linear cross-fading in the transition region which is a common practice in automatic DJ systems [spotify_transition, highlight_dj].

  • Rule consists of a set of general-purpose decision rules that we devised after consulting with expert DJs. Depending on the presence or absence of vocals, Rule applies one of the following four transitions types; vocal to vocal (V-V), non-vocal to vocal (NV-V), vocal to non-vocal (V-NV) and non-vocal to non-vocal (NV-NV). For each transition type, a specific fading pattern and EQ preset is applied.

  • Human: lastly, the second author of the paper, who is an ameature DJ, creates the transitions by hand to serve as a possible performance upper bound.

For fair comparison, Rule and Human use a 4-band EQ. For all methods and for each pair , the same and are used.

We include eight groups of paired tracks from the testing set, two for each transition type (e.g., V-V) mentioned above. We conduct the listening test via an online questionnaire. Participants are directed to a listening page containing the mixes of the five methods for one random group of paired tracks. We first ask them to identify the actual timestamps when they notice the transition, and then to asses each test sample on a five-point Likert scale (1–5) in terms of Preference and Smoothness, and on a three-point Likert scale (1–3) in terms of Melodious (to further test whether the transition contains conflicting sounds). Participants can also leave comments on each test sample. Finally, we ask them to specify the mix they like the most. We ask this last question mainly to test consistency with the given Preference ratings of the individual mixes.

Figure 5 shows the mean opinion score (MOS). We discard participants whose answer to the first question is not within the transition region, and those whose answer to the last question shows inconsistency. 136 out of the 188 participants met the requirements. They self-reported their gender (105 male), age (17-58), and experience of DJing (46 experienced). A Wilcoxon signed rank test [wilcoxon] as performed to statistically assess the comparative ability of the five approaches. Overall, the responses indicated an acceptable level of reliability (Cronbach’s 0.799 [cronbach]).

Experienced subjects tend to score lower than inexperienced subjects for the transitions. Human receives the highest ratings as expected, but there is no statistically significant difference among the methods with respect to Melodious. Linear, Rule, and GAN are significantly better than Sum for Melodious. Inexperienced subjects also rate Human the highest, but Human is not significant better than Linear in Preference and Melodious. The difference between Human and either Linear or GAN is also not significant in Smoothness. Sum receives the lowest ratings as expected.

We find that some of the experienced subjects tended to give low scores to all transitions. Based on the comments, we infer that the music style of the selected paired tracks is unacceptable to them and therefore greatly influences their ratings. On the other hand, this was not observed in the inexperienced subjects. Overall, there is no significant difference among GAN, Linear and Rule, suggesting that our GAN approach can achieve competitive performance compared to the baselines except for Human.

DJ transitions can be considered an art [spotify_transition, reverse_dj_transition] and therefore a highly subjective task whose result cannot be objectively categorized as correct or incorrect. This is similar to the field of automatic music mixing [eval_mixing], where an exploration of this issue has led to an analysis of the participants’ listening test comments [peer_review].

Following this idea, we collect in total 150 comments from 58 subjects (24 experienced). According to the comments, a large part of the experienced subjects indicate that the paired tracks are not suitable for each other and that the decision of the selected structure and cue points is erroneous, suggesting room for improving the data generation pipeline. In addition, several comments indicate that these subjects rate Linear, Rule and Human higher because they can recognize the mixing technique being used. In contrast, they are unfamiliar with possible techniques employed by our GAN, which may imply that GAN creates its own style by integrating multiple DJ styles from real-world data. Nevertheless, in-depth analysis of the learned mixing style is needed in the future. Finally, some experienced subjects commend that GAN makes good use of the high-pass filter when mixing vocals and background music, especially in the V-NV transition type. The mixes from GAN may not be perfect, but they feel more organic than the others (except for the one by human). People also criticize GAN for making the transition too fast within the mix. These kinds of fast transitions are in particular unpleasant for the V-V type of transition.

5 Conclusion and Future Work

In this paper, we have presented a data-driven and adversarial approach to generate DJ transitions by machine. We have developed a data generation pipeline and proposed a novel differentiable DJ mixer for EQs and loudness faders. Differentiable EQ is achieved in the time-frequency domain by using trainable fade-out curves along the frequency axis, which are based on the frequency response of low-pass filters. Our method is an alternative to differentiable FIR and IIR filters, although we do not have space to present an empirical performance comparison against these alternatives. We have also conducted a subjective listening test and showed that our model is competitive with baseline models. While not reaching human-level quality, our method shows the feasibility of GANs and differentiable audio effects when performing audio processing tasks. This has potentials to be applied in other tasks such as automatic music mixing and mastering, audio effect modeling, music synthesis, and radio broadcast generation.

As future work, the quality of the data generation pipeline can be improved, especially the structure segmentation and segment pairing parts. The cue points selection process could also be learned instead of using fixed cue points. Furthermore, an analysis of the transitions learned by the model and an exploration of comprehensive evaluation metrics can also be explored.