DeepAI
Log In Sign Up

High-Fidelity Audio Generation and Representation Learning with Guided Adversarial Autoencoder

06/01/2020
by   Kazi Nazmul Haque, et al.
0

Unsupervised disentangled representation learning from the unlabelled audio data, and high fidelity audio generation have become two linchpins in the machine learning research fields. However, the representation learned from an unsupervised setting does not guarantee its' usability for any downstream task at hand, which can be a wastage of the resources, if the training was conducted for that particular posterior job. Also, during the representation learning, if the model is highly biased towards the downstream task, it losses its generalisation capability which directly benefits the downstream job but the ability to scale it to other related task is lost. Therefore, to fill this gap, we propose a new autoencoder based model named "Guided Adversarial Autoencoder (GAAE)", which can learn both post-task-specific representations and the general representation capturing the factors of variation in the training data leveraging a small percentage of labelled samples; thus, makes it suitable for future related tasks. Furthermore, our proposed model can generate audio with superior quality, which is indistinguishable from the real audio samples. Hence, with the extensive experimental results, we have demonstrated that by harnessing the power of the high-fidelity audio generation, the proposed GAAE model can learn powerful representation from unlabelled dataset leveraging a fewer percentage of labelled data as supervision/guidance.

READ FULL TEXT VIEW PDF

page 1

page 9

page 10

page 11

page 13

page 20

01/13/2020

High-Fidelity Synthesis with Disentangled Representation

Learning disentangled representation of data without supervision is an i...
08/16/2020

Audio Dequantization for High Fidelity Audio Generation in Flow-based Neural Vocoder

In recent works, a flow-based neural vocoder has shown significant impro...
11/06/2022

I Hear Your True Colors: Image Guided Audio Generation

We propose Im2Wav, an image guided open-domain audio generation system. ...
10/06/2021

A Deep Learning-based Audio-in-Image Watermarking Scheme

This paper presents a deep learning-based audio-in-image watermarking sc...
02/11/2022

Investigating Power laws in Deep Representation Learning

Representation learning that leverages large-scale labelled datasets, is...
10/02/2019

Animating Face using Disentangled Audio Representations

All previous methods for audio-driven talking head generation assume the...

I Introduction

Representation learning is a very requisite research field where the common belief is that any higher-dimensional data can be mapped into a lower-dimensional representation space where the variational factors of the data are disentangled. Thus implies that the distinct and informative characteristics/attributes of the data are easily separable in the representation space [bengio2013representation]. Therefore, learning disentangled representation from unlabelled dataset opens a window of opportunity for the researchers to utilise the vastly available unlabelled dataset for any downstream tasks [francesco_2019] such as learned disentangled representation from freely available YouTube audios can be used to improve the emotion recognition task from audio where the large labelled dataset is unavailable.

Recently, the Generative Adversarial Neural Network (GAN)

[goodfellow:2014] has shown prodigious success for generating real-like samples by capturing the training data distribution [karras:2017, karras2019style, Andrew_biggan, donahue2019large]. Here, the GAN is comprised of a Generator network and a Discriminator network where these networks are trained to beat each other based on a minimax game. During the training session, the Generator tries to fool the Discriminator by generating real-like samples from a random noise/latent distribution, and the Discriminator tries to defeat the Generator by differentiating the generated sample from the real samples [goodfellow:2014]. During this game-play, the Generator disentangles some underlying attributes of the data in the given random latent distribution [radford2015]. Therefore, researchers have achieved great success in terms of learning powerful representation [chen2016infogan, chorowski_wavenet_autoencoder, radford2015, donahue2019large, zhao:2017, makhzani:2016, karras2019style] with GAN based models in a completely unsupervised manner. Hence, GAN based models can be used successfully in the field of audio research where limited or no labelled data is available.

Here, the representation learning capability of the GANs is dependent on its’ sample generation quality [donahue2019large]. Though the GAN based models are successful at generating high fidelity images, it fails to perform likewise for the complex audio waveform generation [engel2019gansynth]. Thus, to successfully generate audio with GANs, researchers have focused on working with the spectrogram (image-like 2D representation) of the audio which can be converted back to the audio with minimal loss [chris_wspecgan, engel2019gansynth, marafioti2019adversarial]. However, still the recently proposed high performing GAN architectures such as BigGAN [Andrew_biggan] or StyleGAN [karras2019style] are not well explored in this audio field thus leaving a room for the researchers to explore the compatibility of these models for audio data.

Now, the representation learned with GANs in a completely unsupervised manner does not guarantee the usability of the learned representation for any particular downstream task because it can ignore the important characteristics of the data during the training which is important for succeeding in the downstream job [haque2020guided]. So, some shorts of bias towards the downstream task is necessary during the unsupervised training to succeed in the posterior task [francesco_2019].

Hence, learning meaningful representation from the unlabelled dataset using GAN models, requires good generation as well as some guidance towards the downstream task. Therefore, we proposed a BigGAN based architecture called “Guided Generative Adversarial Neural Network (GGAN)”, which is capable of learning powerful representation from an unlabelled dataset with the guidance based on some labelled data samples by harnessing the power of its’ high fidelity spectrogram generation. But the focus of the GGAN was to learn representation for any particular downstream task which makes the learned representation useless for any other unrelated task [haque2020guided].

Nonetheless, in many cases, it is desirable to learn representation in a manner so that it can be used for any particular downstream task as well as can be used for any future tasks independent of the downstream job at hand [bengio:2013]. So to address the shortcoming of the GGAN model and the gap in the superior audio generation research, in this paper we propose a novel autoencoder based model named “Guided Adversarial Autoencoder (GAAE)”. Here, our GAAE model can generate diverse and high fidelity audio samples and using this superior generation quality; it can learn two types of useful representations from an unlabelled audio dataset with a minimal amount of labelled data as guidance. Here, among these two types of representations, one is the guided/post-task-specific representation for capturing the attributes/characteristics for the downstream task, and another one is the general/style representation for capturing other general attributes of the data which is independent of the task at hand. In this paper, our primary contributions can be summarised as follows.

  • We have proposed a novel autoencoder based model named GAAE, which can learn to generate high fidelity audio samples capturing the diverse mode of the training data distribution leveraging the guidance from a fewer percentage of labelled data samples from that particular or related dataset. Hence, we evaluate the sample generation quality of the proposed model based on two audio datasets from different domains; the Speech Command dataset (S09) and the Musical Instrument Sound dataset (Nsyth). After comparing the models’ performance with the literature, we have demonstrated that the GAAE model has performed significantly better than the state-of-the-art (SOTA) models.

  • We evince that our GAAE model can learn to disentangle general attributes/characteristics of the data in the representation space which can be beneficial for any other future potential tasks. Furthermore, we have also demonstrated that the GAAE model can learn post-task-specific representation from unlabelled dataset according to the given guidance, which directly benefits the downstream tasks at hand, where the guidance comes from a fewer percentage of labelled samples, either from the same dataset or from other related datasets. The representation learning of the GAAE model is evaluated on three different datasets; the Speech Command dataset (S09), the Audio Book Speech dataset (Librispeech) and the Musical Instrument Sound dataset (Nsyth). After the evaluation, we have demonstrated that the GAAE model performs better than SOTA models.

Ii Background and Related Work

Ii-a Audio Representation Learning

Ii-A1 Supervised Representation Learning

Neural Network can learn powerful representation from supervised training on the large dataset, and this learned representation can be used for similar tasks or scenarios, where limited labelled data is available [goodfellow_book:2016]

. This supervised representation learning is prevalent in the field of computer vision, and many researchers have shown successful implementations due to the availability of the enormous amount of labelled data

[zhu2011heterogeneous, shao2014transfer, huh2016makes, shin2016deep, Rana7486123, latif2019direct, rana2019multi, li_PAMI]. Likewise, in the Audio domain, there are some availabilities of the large labelled datasets, so the researchers have utilised this opportunity to train neural networks to learn representation in a supervised manner and then transfer that learning for further audio processing tasks where labelled data is limited. In this work [Vesely], authors have conducted supervised training with Artificial Neural Network on multilingual speech database GlobalPhone [schultz2002globalphone] to get Language-Independent Bottleneck Features. In another work [elshaer2019transfer] related to speech, authors have learned supervised representation from Soundnet Dataset [Aytar_soundnet]

and used it to improve the performance of the anger detection in the speech audio. Apart from speech audio, supervised representation learning is successful for other acoustic scene classification works

[Rakotomamonjy, Aytar_soundnet]

viz; researchers have pretrained convolutional neural network on audioset

[gemmeke2017audio] which is a dataset of weakly labelled sound events from YouTube videos, to learn representation and use it for audio classification scenario where the labelled dataset is limited [Pons_ICASSP]. For instance, Kumar et al. [Kumar_IEEAS] have used this supervised representation learning from audioset to improve environmental sound classification tested on ESC-50 dataset [piczak2015esc]. In the musical domain, Million Song Dataset [bertin2011million] is used to determine supervised representation to ameliorate other musical audio classification tasks [van2014transfer, Wang]. Apart from these, researchers have successfully transferred Inception-v4 [Inception_v4] model, which is trained on image classification, to the acoustic domain for the classification of bird sounds [sevilla2017audio]. So supervised representation learning is very rewarding when we have access to a large amount of labelled dataset, but there are many cases where we have unlabelled datasets and labelling those datasets are very expensive, which makes supervised representation learning method obsolete for this scenario. Thus, the unsupervised representation learning solves this problem by learning meaningful representation from the unlabelled dataset. Therefore, this learned representation can be used to improve any other tasks on related datasets where labels are limited [Schuller_2018, latif2020deep].

Ii-A2 Unsupervised Representation Learning

In the context of the unsupervised representation learning, the self-supervised learning has become very popular recently due to its unprecedented success in the field of computer vision

[zhang_colorful:2016, larsson:2016, doersch_un:2015, haque2018image, liu2019selfsupervised, Zhan_2019, feng2019self]

and natural language processing

[devlin2018bert, wu2019self, su2019vl, wang2019self]. Here, the self-supervised learning methods use the information present in the unlabelled datasets to provide a supervision signal for the feature/representation learning [spyros:2018]. Likewise, in the audio field, researchers have achieved noteworthy performances using self-supervised representation learning. Here, in a work of Deepmind [van:2018], the authors have proposed a model to learn useful representation from unsupervised speech data through predicting future observation in the latent space. In another work from Google [de2019learning]

, the representation is learned by predicting instantaneous frequency based on the magnitude of the Fourier transform. Furthermore, Arsha and et al. (2020)

[A9054057] proposed a cross-modal self-supervised learning method to learn speech representation from the co-relationship between the face and the audio in the video. Other efforts have been made by the researchers to learn general representation by predicting the contextual frames of any particular audio frames like wav2vec [schneider2019wav2vec], speech2vec [chung2018speech2vec], and audio word2vec [chung2016audio]. Likewise, there are other successful implementations [kawakami2020learning, riviere2020unsupervised, baevski2019vqwav2vec, baevski2019effectiveness] of the self-supervised representation learning in the field of audio.

Though self-supervised learning is very efficacious at learning representation from the unlabelled dataset, it requires manual endeavour to design the supervised signal [latif2020deep]. Hence, autoencoders are mostly used by the researchers to learn representation from unlabelled dataset [amiriparian2017sequence, lee2009unsupervised, Xu_IEE] in a fully unsupervised manner. Therefore, in this paper [Neumann_2019]

, authors learned representation with autoencoder from a large unlabelled dataset, which improved the emotion recognition from speech audio. Similarly, in another work, the authors used denoising autoencoder to improve the affect recognition from speech data

[ghosh2015learning]. Several works [W8268911, Chorowski_wavenet, hsu2019disentangling] have utilised the Variational Autoencoders (VAEs) [kingma:2013] to learn efficient speech representation from unlabelled dataset. Recently, given the popularity of the adversarial training, different works have been conducted by the researchers to learn robust representation with GANs [J7952656, yu2017adversarial] and Adversarial Autoencoders [sahu2018adversarial, E7966273].

Though learning representation from prodigiously available unlabelled datasets is very intriguing, the recent work from Google AI has proved that the completely unsupervised representation learning is not possible without any form of supervision [francesco_2019]. Also, representation learned from an unsupervised method does not guarantee the usability of this learned representation for any post use case scenario. Thus, we proposed Guided Generative Adversarial Neural Network (GGAN) [haque2020guided], which can learn powerful representation from unlabelled audio dataset according to the supervision given from a fewer amount of labelled dataset. Therefore, in the learned representation space, the GGAN disentangles attributes of the data according to the given categories from the labelled dataset, which benefits the related post-use case scenario. Still, the generalisation is lost thus can not be used for non-related tasks. For an example, if the GGAN is guided with small amount of dataset with emotion labels and trained on a large number of speech audios from different people, the GGAN will learn emotion-related representation ignoring the other attributes such as gender of the speaker, background noise, pitch, intensity etc. Therefore, this will help to improve the emotion recognition task but can not be used for other tasks such as speaker gender identification [haque2020guided]. Hence, we overcome this shortcoming by proposing Guided Adversarial Autoencoder (GAAE) model, which can learn general attributes of the unlabelled dataset in the representation space as well as the characteristics according to the given guidance from the fewer labelled data samples.

Ii-B Audio Generation

Most of the audios are periodic, and high fidelity audio generation requires modelling higher order magnitude of the temporal scales, which makes it a challenging problem for the researchers [engel2019gansynth]

. Most of the research works related to audio generation are based on the audio synthesis viz; Aaron and et al. (2016) have proposed a powerful autoregressive model named “Wavenet” where it works great on text to speech (TTS) synthesis for both English and Mandarin. Later the authors have improved this work by proposing “Parallel Wavenet”, which is 20 times faster than “Wavenet”. Other research works have utilised the seq2seq model for TTS such as Char2Wav

[sotelo2017char2wav] and TACOTRON [wang2017tacotron]. However, these audio generation methods are conditioned on the text data and mainly focused on speech generation. Thus, these methods can not be generalised to all other audio domains, even for speech data where transcripts are not available.

Here, In the context of generating audio without any condition on the text data, the GANs are very promising due to its massive success in the field of computer vision [donahue2019large, dumoulin2016adversarially, DonahueKD16, karras2019analyzing, karras2019style]. However, porting these image GAN architectures directly to the audio domain does not offer similar performance as the audio waveform is very complex than the image [chris_wspecgan, engel2019gansynth]. Therefore, researchers have focused on generating spectrogram (2D image-like representation of the audio) rather than generating direct waveform. Then the generated spectrogram is converted back to audio. Here, Chris et al. (2019) [chris_wspecgan] has trained GAN based model to generate spectrograms and successfully converted back to the audio with Griffin-Lim algorithm [griffin1984signal]. Furthermore, in the TiFGAN paper [marafioti2019adversarial], authors have proposed phase-gradient heap integration (PGHI) [zden_25] algorithm for better reconstruction of the audio from the spectrogram with minimal loss. As PGHI algorithm is good at reconstructing audio from the spectrogram, now the challenge is to generate realistic spectrogram. As the spectrogram is an image-like representation of the audio, any GAN based framework from the image domain should be compatible. Hence, the BigGAN architecture [Andrew_biggan] has shown promising performance at generating high fidelity image generation, but it was not explored for the audio generation. Therefore, to fill this gap, we proposed Guided GAN (GGAN) architecture [haque2020guided], which can generate superior audio with a fewer labelled dataset as guidance. Here, the GGAN model suffers from severe mode collapse, which is solved to some extent by the feature loss. Though GGAN achieved a SOTA performance in the audio generation, it does not guarantee superiority in terms of diversity of the mode within the generated samples. So we improve this work by proposing Guided Adversarial Autoencoder (GAAE) which ensures the high fidelity image generation as well as the mode diversity.

Ii-C Closely Related Architectures

The proposed GAAE model is a semi-supervised model as we leverage a small amount of labelled data during the training. Here, In this work [Spurr2017], the authors proposed a semi-supervised version of the InfoGAN model [chen2016infogan] to capture specific representation and generation according to the supervision which comes from the small number of labelled data. But, the success of this model in terms of the complex data distribution is not evident. Other researchers have explored the scope of the semi supervision in the GAN architectures [springenberg2015unsupervised, sricharan2017semisupervised, lucic2019highfidelity] to improve the conditional generation but most of these works are not explored in the audio domain which leaves a major gap for the researchers to address. The GAAE model is based on Adversarial Autoencoder (AAE) [makhzani:2016], where we have extended the AAE model to learn task-specific and generalised representation from the unlabelled dataset in a semi-supervised fashion. Furthermore, in the GAAE model, we have implemented a unique way to leverage the small amount of labelled data for high-fidelity audio generation. Here, we have also proposed a way to utilise the generated samples for improving the representation learning. Moreover, the building blocks for our GAAE model is BigGAN architecture; thus, we further contribute by exploring the use of BigGAN in an autoencoder based model for audio data.

Iii Proposed Research Methods

Fig. 1: This figure illustrates the overall architecture of the GAAE model. Different networks of GAAE model are shown along with the connections between them. In the figure, the arrows are coloured to understand the flow of any input/output of the model. For the Discriminator, the red boxes show the fake samples and the green boxes indicates the real samples. Here, is the unlabelled data sample, is the labelled data sample, is the reconstructed data sample, is the random conditions, is the known latent distribution.

Iii-a Architecture of the GAAE

GAAE is consisted of five neural networks ; Encoder , Decoder

, Classifier

, Latent Discriminator and Sample Discriminator . Let, the parameters for these networks be , , , , respectively. The Figure 1 shows the whole architecture of the model and the description is as follows.

Iii-A1 Encoder

The Encoder takes any unlabelled data sample and outputs two latent samples and , where is the true unlabelled data distribution and , are two different continuous distributions learned by the . Here, we want the latent to capture the post-task-specific attributes/characteristics of the data and the latent to capture the general/style attributes of the data.

Iii-A2 Classifier

We have a classifier network which is trained with limited labelled data , where is the labelled data distribution and not necessarily . Here, with this the whole model get the guidance thus we call this data as “guidance data”. Now, the network takes any latent sample and predicts the category class for that latent sample. To train , we pass through the

network and get two latent vectors {

,} = . Then we only forward through to get the predicted label = and train against the true label of the sample , where is the categorical distribution with numbers of categories/labels. These labels are used as one-hot vector. For now, lets consider that can classify the label of any sample correctly.

Iii-A3 Decoder

The Decoder maps any latent and categorical class/label variable to the data sample. Now, to get the reconstructed sample of , we pass the latent and the label of through the network. As is an unlabelled data sample, we get the label = through the network and get the reconstructed sample = from the network. Here, we also want to use the network for generating samples according to the given condition along with the reconstruction. Therefore, the same latent

is used with a random categorical variable (one-hot vector)

, sampled from categorical distribution

, where n is the number of categories/labels and sampling probability for each category is

. Now, we get the generated sample , where is the generated data distribution by the network and it is trained to match with the true data distribution . Here, the size of n is the same as the guided data, and we want the network to generate data according to the categories from the guided data. Therefore, we ensure this with the Discriminator where the Discriminator gets the labels of the data from the network . As we use a small number of labelled data, it is hard to train due to the problem of overfitting. So we use generated sample and train the network considering as the true label/category, where the predicted label is = .

Here, depends on the correct conditional generation from and depends on the classification from the . During the training, the network starts to predict the category of some samples from the given labelled data correctly. So the Discriminator learns to identify the correct category for those samples and force the network to generate samples with the attributes related to these correctly classified samples. These, generated samples bring more characteristics with it, which is not present in the given labelled data but belongs to the data distribution. Now, as we feed these generated samples again to the network with the associated conditional categories as correct labels, it learns to predict the correct category for more samples related to that generated samples. Then again, these new correctly classified samples improve the conditional generation of the network. Hence, throughout the training, the network and network improve each other continuously. Meanwhile, during the training, representation learning (latent generation) capability of the network is also ameliorated via the process of reconstructing sample , which also improves the performance of the and network eventually.

Iii-A4 Discriminators

The GAAE model has two discriminators ; Sample Discriminator and Latent Discriminator . The makes sure that the generated sample and reconstructed sample , match the sample from the true data distribution . We train with the sample and its label. Now, for the samples and , we have labels , respectively. So the pairs and are considered fake labels for the . For the true data, both and is used together, where we get the label for the sample from and for the sample we use the available true labels. Hence, in terms of distribution perspective, we get data distribution , mixing the distribution and . So is trained with the true sample data along with its’ associated label if exists, otherwise the predicted label from .

Here, the learns to map the general characteristics of the data in the latent distribution , excluding the categories from the guided data. Now, if we can draw the sample from distribution then, by using the categorical distribution as condition, we can generate diverse data for different categories (categories from the guided data) from the Decoder . We can only sample from if the distribution is known to us. Therefore, we use another Discriminator so that the network is forced to match to any known distribution , where

can be any known continuous random distribution (e.g. Continuous Normal Distribution, Continuous uniform distribution). The

network is trained through differentiating between the true latent and the fake latent .

Iii-B Losses and Training

Iii-B1 Encoder, Classifier and Decoder

For the and networks, we have sample generation loss , sample reconstruction loss and latent generation loss . To calculate generation and discrimination loss, we use hinge loss and for the reconstruction loss the Mean Squared Error (MSE) loss is used. For the , we take the average of the generation loss for and . Therefore,

(1)
(2)
(3)

Now, for the network, we calculate classification loss , for the labelled data sample and the generated sample respectively. Here, is used as a constant, so it is consider like a sample data . We only forward propagate through and and no gradient is calculated for generating when it is only used for the loss

.The model is implemented with pytorch

[paszke2019pytorch] and we detach the gradient of when is calculated. Therefore,

(4)
(5)

We get the a combined loss for , and . The is calculated as,

(6)

Here, the weights of the , and networks are updated to minimise the loss , where , , , , , , ,

are the hyperparameters. The successful training of our GAEE model depends on these parameters. At the beginning of the training, we have noticed that the value of

falls rapidly compared to other losses and results in very small gradient value. To mitigate this problem, we multiply with a hyperparameter and after hyperparameter tuning, we have found 20 as an optimal value for the . The network of the model is tuned for both the reconstruction loss and the generation loss . Therefore, to balance between these two losses, the hyperparameter and is used where , and + = 1. Here, we can force the model to focus more on either loss by increasing the hyperparameter for that particular loss. Likewise, for , and we use hyperparameters , , respectively where , , and + + = 1. In the , and are responsible for sample generation quality, where , and are responsible for the latent generation quality. So to balance between sample generation and latent generation, we use two hyperparameters and where , and + = 1.

Iii-B2 Discriminators loss

For the Discriminator and , we use hinge loss. The discrimination loss for the fake samples are averaged as we calculate the loss for both and . Let the discrimination loss for and be , respectively. Therefore,

(7)
(8)

Here, we update the parameter and to maximise the loss and respectively. The algorithm 1 shows the training mechanism for the GAAE model.

1:for number of training iterations do
2:   for  steps do
3:       Sample the latent/noise samples from , the conditions (labels) from , the unlabelled data samples from and the labelled data samples from . Here, is the minibatch size.
4:      Update the discriminator by ascending its stochastic gradient:
5:      Update the discriminator by ascending its stochastic gradient:
6:   end for
7:   Repeat step [3].
8:   Update the Encoder , Decoder and Classifier by descending its stochastic gradient:
9:end for
Algorithm 1

Minibatch stochastic gradient descent training of the proposed GAAE model. The discriminator is updated

times in one iteration. Here, for our experiment, we use for better convergence.

Iv Data and Implementation Detail

Iv-a Datasets

For training the GAAE model, we have used three audio datasets; S09 dataset [Pete_03209], Librispeech dataset [panayotov2015librispeech] and Nsynth dataset [nsynth2017]. The S09 dataset consists of audios for different digits categories from zero to nine. This dataset is very noisy and comprised of 23,000 one-second audio samples uttered by 2618 speakers where the samples are labelled poorly. Furthermore, the S09 dataset only contains the labels for the audio digits [Pete_03209]. Here, the Librispeech dataset is an English speech dataset with 1000 hours of audio recordings, and there are three subsets available in this Librispeech dataset containing approximately 100, 300 and 500 hours of recordings respectively. We used the subset with 100 hours of clean recordings as we do not need a large number of audios from this dataset. In this subset, the audios are uttered by 251 speakers where 125 are female, and 126 are male [panayotov2015librispeech]. For our experiment, we only used the audios along with the gender labels of the speakers. Moreover, the Nsynth audio dataset contains 305,979 musical notes of size four seconds from ten different instruments where the sources are either acoustic, electronic or synthetic [nsynth2017]. For this research work, we have used only three instruments with acoustic sources which are guitar, string and mallet.

Iv-B Data Preprocessing

To evaluate the GAAE model, we have used audio of length one second where the exact sample size was 16384, and the sampling rate was 16kHz. For the S09 dataset, we have zero-padded the one-second audios (16000 samples) to reach the sample size of 1634, where, for the Librispeech dataset, the one-second audio (16384 samples) was taken randomly from any particular audio clip. Furthermore, for the Nsynth dataset, the first one-second audio (16384 samples) was taken from any audio sample as it holds the majority of the instrument sound representation.

The audio data is converted to the log-magnitude spectrograms with the short-time Fourier Transform, and the generated log-magnitude spectrograms of the GAAE model are converted to audio using PGHI algorithm [zden_25]. From now on we refer the log-magnitude spectrogram as the spectrogram.

To obtain the spectrogram representation of the audio, the short-time Fourier Transform was calculated with the overlapping Hamming window of size 512 ms, and the hopping length was 128 ms. Therefore, the size of the spectrogram become 256 128 and then, we standardise the spectrogram with the equation where, X is the spectrogram, is the mean of the spectrogram and

is the standard deviation of the spectrogram. Now we clip the dynamic range of the spectrogram at

, where, for the S09 and Librispeech dataset, the suitable value of was 10, and for the Nsynth dataset it was 15. After the clipping, we normalised the spectrogram values between -1 and 1. Now, this spectrogram representation of the audio is used as the input to the GAAE model. Furthermore, the GAAE model generates the spectrograms with values between -1 and 1. Then, we convert these spectrograms to audios with PGHI algorithm. For ease, we will refer these audios calculated from generated spectrogram as generated audios throughout the rest of the paper.

Iv-C Measurement Metrics

We measured the performance of the GAAE model, based on the generated samples and the learned representations. Thus, the generated samples are evaluated with the Inception Score (IS) [salimans:2016] and Fréchet Inception Distance (FID) [heusel2017gans, barratt2018note] as these scores have become a de-facto standard for measuring the performance of any GAN based model [shmelkov]

. To evaluate the representation/latent learning, we have considered classification accuracy, latent space visualisation and latent interpolation.

Iv-C1 Inception Score (IS)

The IS score is calculated based on the pretrained Inception Network [szegedy2014going]

trained on the ImageNet dataset

[imagenet_cvpr09]

. First the logits are calculated for the images from the bottleneck layer of the Inception Network. Then the score is calculated with the equation given by,

(9)

Here, is the image sample,

is the Kullback-Leibler Divergence (KL-divergence)

[kl], is the conditional class distribution for sample predicted by the Inception Network and is the marginal class distribution. So the IS score computes the KL-divergence between the conditional label distribution and the marginal label distribution where the higher value indicates good generation quality.

Iv-C2 Fréchet Inception Distance (FID)

The IS score is computed solely on the generated samples; thus no comparison is made between the generated and real samples and is not a good measure for the samples diversity (mode) of the generated samples. So FID score solves this problem by comparing real samples with the generated samples [shmelkov] during the score calculation. Therefore, Fréchet Inception Distance (FID) computes the Fréchet Distance [dowson1982frechet]

between two multivariate Gaussian distributions for generated and real samples, parameterised by the mean and the covariance of the features extracted from the intermediate layer of the pretrained Inception Network. Therefore, the FID score is calculated based on,

(10)

Where, , are the means for the features of the real and generated samples respectively and similarly , are the covariances. Here, the lower value of the FID score indicates good generation quality.

The Inception Network is trained on the imagenet dataset thus offer reliable IS and FID score for related image dataset, but the spectrograms of the audios are entirely different from those imagenet samples. So, the Inception Network does not offer trustworthy scores for the audio spectrograms. Hence, instead of using Inception model, we train a classifier network based on the audio dataset and use this trained Classifier to calculate the IS and FID score for that particular dataset. In the case of S09 dataset, we used the pretrained Classifier released by the authors of the paper “Adversarial Audio Synthesis” [chris_wspecgan] for a fair comparison and for the Nsynth dataset we train a simple Convolutional Neural Network (CNN) as the Classifier.

Iv-D Experimental Setup

First, we have evaluated the overall sample generation quality of the GAAE model with the IS and FID scores, calculated based on the 50,000 generated samples for the random latent , and the random condition . Here, The spectrograms of the samples are generated from the network and then they are converted to audios. After that, We used these generated audios to calculate the scores. Now, for all the dataset, we have used continuous normal distribution of size 128 for the latent and ten digit categories (0-9) as the conditions for the S09 dataset. Furthermore, the three instrument categories (0-3) are used as the conditions for the Nsynth dataset.

The GAAE model is trained with different percentages (from 1% to 5%) of the data as guidance for both S09 and Nsynth dataset. For any particular percentage of data used as guidance, we have trained the GAAE model three times (in each run dataset was sampled randomly for the guidance), and the results are shown as the mean with the standard deviation. Here, due to having high wall time (approximately 21 hours on the two Nvidia p100 GPUs) for each run, we evaluate the model based on only three runs for any particular evaluation. Therefore, total wall time for the S09 and Nsyth dataset is approximately 21 3 5 (for different percentages of data) 2 (for two datasets) = 630 hours or 26.25 days. Here, each run takes approximately 60,000 iterations with mixed-precision training [micikevicius2017mixed] for the batch of size 128.

The results of the GAAE model are compared with the existing literature. Therefore, for comparing the GAAE model with Supervised BigGAN [brock_20118_bigGan] and Unsupervised BigGAN [brock_20118_bigGan], we have taken the results based on S09 dataset, from the GGAN paper [haque2020guided]. Nevertheless, for the Nsyth dataset, we have trained these models with similar code and setting used in the GGAN paper. To calculate the IS and FID score for this Nsyth dataset, we have used our pretrained simple supervised CNN classifier, which is trained on the three classes (Guitar, String and Mallet) and achieved 92.01% 0.94 accuracy using the augmentation technique mentioned in the recent paper from google [park2019specaugment].

To evaluate the effectiveness of the guidance in the GAAE model in terms of generating correct samples from different categories/conditions, we have manually checked the audio samples generated for different categories based on both S09 and Nsynth dataset. However, it is not possible to check all the generated samples manually. So, we have trained a simple Convolutional Neural Network (CNN) Classifier with the samples generated for different random conditions/categories and used the random categories associate with the generated samples as the true labels. Then we evaluate the CNN Classifier on the test dataset based on the classification accuracy. If the GAAE model does not learn to generate correct samples for any given category and the generated samples do not match the training data distribution, the CNN model will never achieve good accuracy on the test dataset. Likewise, in this paper [shmelkov], the authors have suggested using this method to evaluate the performance of the overall generation quality of the model, along with the IS score and the FID score. For this evaluation, two CNN models with the same architecture are trained on the training data, and the generated data, respectively, where the size of the generated data is equal to the training data, and the class/category distribution is also kept the same. We have conducted this experiment for both S09 and Nsynth dataset. For the sake comparison, we have also trained another two CNN models based on the generated samples from the supervised BigGAN and the GGAN model.

For both S09 and Nsyth Dataset, we have used the small amount of labelled data as guidance from the same dataset. So we wanted to investigate if the guidance from a completely different dataset works alike. In S09 dataset, we do not have the label available for the gender of the speakers, and we want to generate samples according to the conditions on the gender category. So we have collected random ten male and ten female speakers’ audio data from completely different Librispeech dataset to use as guidance during the training with S09 dataset. Here we used the S09 training data as the unlabelled dataset and Librispeech as the labelled dataset for the guidance on the gender labels. Similarly, like above experiments, we have used continuous normal distribution of size 128 for latent and two gender categories for the conditions .

Learning better Classifier with fewer labels is another prime goal of the GAAE model. The Classifier network of the GAAE model learns to classify the training data according to categories of the guidance data. For the S09 dataset, it learns to classify the digit categories, and for the Nsynth dataset, it learns the instrument classes. We designed the GAAE model to achieve the accuracy near to any supervised classifier. After training the GAAE model on any particular dataset, we did the evaluation based on the test data classification accuracy on that distinct dataset. For the sake of comparison, we trained a simple CNN classifier based on 1% to 5% training data where the data was heavily augmented with the techniques from googles’ paper [park2019specaugment] ( e.g. adding random noise, rotation of the spectrogram, multiplication with random zero patches etc.). Also, for further comparison, we trained BiGAN [DonahueKD16] model top of the unsupervised BigGAN and extracted the feature network after the training. Now, we train another feed-forward classifier network top of this feature network with the labelled data of size 1% to 5% where the weights for the feature network is fixed during the training. Then we evaluate this Classifier based on the test dataset. As the Classifier of the GAAE model is trained with fewer labelled data along with the generated samples from the decoder , it will perform better only if the quality of the generated samples is near to the real sample and the generation is accurate according to the different categories. If the GAAE model does not learn the categorical distribution of the dataset according to the guidance, it will barely achieve a good result on the test dataset. Hence, we conducted this experiment for both S09 and Nsyth dataset.

In the GAAE model, the Classifier is built top of the latent , so network should learn this latent to disentangle the class categories according to the guided data. Like for the S09 dataset, we are using digit class as guidance, so in this latent space (representation space), the digit category should be disentangled. Furthermore, to explore this disentanglement, we have visualised the higher dimensional (128) latent space generated for the S09 test data in the 2D plane with the t-SNE (t-distributed stochastic neighbour embedding) [maaten2008visualizing] visualisation method.

The network of the GAAE model is trained to match the distribution with the known distribution. So we can sample from the distribution. To explore the learned representation , we have generated audio samples for different categories/conditions keeping the same. Then, we manually hear the audios to investigate this scenario.

It is expected that the network of the GAAE model should learn to map the latent space to the data distribution, so we can explore the latent space through generating sample for any particular latent. To investigate the latent space further, we have conducted linear interpolation between two latent points like the DCGAN paper [radford2015]. Therefore, a particular latent point within two latent points and is calculated with the equation , where is the step size from to . Here, with this equation, we get the latent points in between the and . Moreover, from the network, we get the generated samples for these latent points where the value is fixed.

For implementing our GGAN model, we have followed the network implementations, optimisation and hyperparameters from BigGAN paper [Andrew_biggan]. For the optimisation we have used the Adam optimiser [Kingma2015AdamAM] with a learning rate of for network , and where was the learning rate for the and . The detailed architectures of the networks are given in the supplementary document.

V Results

Fig. 2:

The figure illustrates the difference between the generated spectrograms and the real spectrograms of the data for the S09 dataset. The top two rows show the randomly generated samples from the GAAE model, and the bottom two rows are the real samples from the training data. Here we can notice the visual similarity between the generated and the real samples.

Model Name IS Score FID Score
Real (Train Data) [chris_wspecgan] 9.18 0.04 -
Real (Test Data) [chris_wspecgan] 8.01 0.24 -
TiFGAN [andrTIFGAN] 5.97 26.7
WaveGAN [chris_wspecgan] 4.67 0.01 -
SpecGAN [chris_wspecgan] 6.03 0.04 -
Supervised BigGAN 7.33 0.01 24.40 0.50
Unsupervised BigGAN 6.17 0.20 24.72 0.05
GGAN [haque2020guided]
GAAE 7.28 0.01 22.60 0.25
TABLE I: Comparison between the sample generation quality of the GAAE model and the other models for the S09 dataset. The generation quality is measured with the IS score and the FID score.

V-a Sample Generation

Using only 5% labelled training data as guidance, the GAAE model has achieved IS score and FID score of . The IS score is near to the supervised BigGAN and better than other research works mentioned in table I. In terms of the FID score, our GAAE model has performed superior to the other models mentioned in the table I, which is the indication for more diversity/modes. The GAAE model has outperformed Supervised BigGAN model in terms of the diverse image generation, where GAAE has used only 5% labelled data and Supervised BigGAN is trained with 100% labelled training data. As our decoder is responsible for reconstructing all the training data as well as for the generation, it is forced to learn more modes of the data distribution than the supervised BigGAN model. The figure 2 displays the spectrogram of the generated and the real samples where figure 4 show the samples for different conditions and latent samples. From these figures, we can observe that, due to the superior generation quality of the GAAE model, the generated samples are indistinguishable visually from the real samples. This is also true when we converted these spectrograms to audios.

Fig. 3: The figure demonstrates the difference between the generated spectrograms of the GAAE model and the real spectrograms of the data for the Nsynth dataset. The top row shows the generated samples, and the bottom row shows the real samples. The first block shows the spectrogram of the guitar, and the other two illustrates the spectrograms for the string and mallet.

To further validate the generation capability of the GAAE model, we used musical instrument dataset Nsynth with three acoustic class; Guitar, String and Mallet. The GAAE model has achieved the IS score of and FID score of with 5% labelled training data as guidance. In terms of the IS score, the performance is very near to supervised BigGAN () and better than unsupervised BigGAN (). For the FID score, the performance is even greater than supervised BigGAN (148.30 0.23). The table II shows the comparison.

So, on both S09 and Nsynth dataset, the GAAE model has achieved superior generation quality like supervised BigGAN and in terms of sample diversity it has performed better than any other models mentioned in the table I and II. The audios can be found on the link: https://bit.ly/3coz5qO.

Model Name IS Score FID Score
Real (Train Data) 2.83 0.02 -
Real (Test Data) 2.81 0.12 -
Supervised BigGAN 2.64 0.08 148.30 0.23
Unsupervised BigGAN 2.21 0.11 172.01 0.15
GGAN 2.52 0.06 149.23 0.09
GAAE 2.58 0.01 141.71 0.50
TABLE II: Comparison between the sample generation quality of the GAAE model and the other models for the Nsynth dataset. The generation quality is measured with the IS score and the FID score.

V-B Guided Sample Generation

V-B1 Guidance for learning categorical distribution

The generated samples for the S09 dataset based on different categories are shown in the fig 4. Furthermore, the generated samples for the Nsynth dataset is shown in the fig 3. However, it is not visually evident from these spectrograms that the model was able to generate correct samples according to the given conditions/categories. Nevertheless, when we converted these spectrograms to audios, it was clear that the model was able to generate audios correctly according to the categories demonstrating the effectiveness of the guidance to learn the specific categorical distribution of the training dataset. The audios can be found on the link: https://bit.ly/3coz5qO

V-B2 Classification accuracy based on the generated samples

For the S09 dataset, the test data classification accuracy for the CNN model trained with all the available labelled data is 95.52% 0.50, and 91.14% 0.17 for the CNN model, which is trained based on the generated samples from the GAAE model. The table III shows the comparison between different models. For the generated samples from the GAAE model, the CNN model has achieved greater classification accuracy than the supervised BigGAN (86.58% 0.56) and the GGAN model (86.72% 0.47). This result demonstrates the superiority of the GAAE model in terms of the sample generation for different categories. So the small amount of labelled data used as the guidance during the training phase has assisted the GAAE model for learning the better conditional distribution of the training data thus demonstrates that the GAAE model has performed better than other models in terms of the sample diversity by capturing different modes of the data distribution.

When we trained the CNN model mixing the train data, and the generated samples from the GAAE model, the accuracy of the CNN model increased from 95.52% 0.50 to 97.33% 0.19. Here, along with the accuracy, the stability of the CNN model is also improved significantly in terms of the standard deviation in the results. We have also conducted the same evaluation on the Nsynth dataset and received similar results which can be found on the table IV. So, we further propose our GGAN model as a data augmentation model thus the generated samples from the GAEE model can be used to augment any related dataset.

V-B3 Effect for the size of the guidance data

Here, The percentage of labelled training data used as guidance has a significant impact on the IS and FID score, which can be found from the table V. It is evident from the results that the more we feed the labelled data during the training, the more we boost the performance of the GAAE model in terms of the sample generation and the diversity. Furthermore, it is also noticeable from here that only with 1% labelled guided data the GAAE model has achieved acceptable performance.

V-B4 Guidance from a different dataset

After calculating the scores for gender based training, we have noticed severe collapse in the performance as the GAAE model has achieved the IS score of 5.31 the 1.8 and FID score of 35.87 3.2. Because of mixing two different datasets during the training, the generated samples belong to both data distribution resulting in bad IS and FID scores as the scores are calculated with the CNN model trained on the digit classification tasks for S09 dataset, not for the gender classification task. Therefore, to eradicate this problem, we have trained a simple CNN model for the gender classification to calculate the IS and the FID score. So we have randomly selected 15 males and 15 females speaker from Librispeech dataset and used ten males and ten females for training (split into train and validation of 80%: 20%) and others for testing. We achieved an accuracy of 98.3 0.50 and used this model to calculate the IS and FID Score for the generated samples from different models. The scores for different models are given on the table VI. Here in the table, there are two GAAE models ; one is trained with the guidance from the S09 dataset with the digit labels and other one is guided with the gender labels from the Librispeech dataset. If we compare between these two GAAE models then the gender class guided GAAE model has achieved better IS, and FID score than other one, which indicates the effectiveness of the guidance in the GAAE model. It is also discernible from the table that the GAAE model has also achieved better IS and FID scores than other models when it was trained based on the digit category guidance which indicates that the GAAE model has learned superior gender distribution even though it was guided with digit classes.

Sample for Training Test Accuracy
Train Data 95.52% 0.50
Supervised BigGAN 86.58% 0.56
GGAN 86.72% 0.47
GAAE 91.14% 0.17
GAAE + Train Data 97.33% 0.19
TABLE III: The comparison between different CNN classifiers based on the test data classification accuracy from the S09 dataset. The CNN models are trained with the generated samples from different models.
Sample for Training Test Accuracy
Train Data 92.01% 0.94
Supervised BigGAN 83.50% 0.62
GGAN 81.40% 0.48
GAAE 86.80% 0.23
GAAE + Train Data 94.56% 0.09
TABLE IV: The comparison between different CNN classifiers based on the test data classification accuracy from the Nsynth dataset. The CNN models are trained with the generated samples from different models.
Labelled Data IS Score (S09) FID Score (S09) IS Score (Nsynth) FID Score(Nsynth)
1% 6.94 0.04 24.21 0.16 2.48 0.08 145.89 1.32
2% 7.06 0.03 23.89 0.11 2.53 0.07 144.21 0.65
3% 7.12 0.04 23.15 0.10 2.56 0.05 143.01 0.43
4% 7.19 0.02 22.91 0.08 2.57 0.04 142.46 0.38
5% 7.28 0.01 22.60 0.07 2.58 0.03 141.71 0.32
100% 7.45 0.03 19.31 0.01 2.67 0.02 137.65 0.02
TABLE V: The relationship between the percentage of the data used as guidance during the training and the sample generation quality of the GAAE model, measured with the IS and the FID score. The scores are calculated for the S09 and the Nsynth dataset.
Model Name IS Score FID Score
Train Data 1.92 0.04 -
Test Data 1.91 0.05 -
Unsupervised BigGAN 1.13 0.89 56.01 0.85
Supervised BigGAN 1.48 0.56 35.22 0.50
GGAN (Digit Guided)
GAAE
GAAE (Gender Guided)
TABLE VI: Comparison between the performance of the GGAN model trained with gender guidance and the other models on the S09 dataset, in terms of the quality of the generated samples based on the gender attributes of the speaker, measured with the IS and the FID score.

Fig. 4: This figure shows the generated spectrograms of the S09 dataset from the GAAE model according to different digit categories. Each row represents the samples generated for a fixed latent variable where the digit condition is changed from 0 to 9. Furthermore, each column shows the generated spectrogram for a particular digit category.

V-C Sample Classification

With 5% labelled data as guidance the GAAE model has achieved the digit classification accuracy of 94.6 0.03 on the S09 test dataset, where the classification accuracy for the fully supervised CNN classifier is 95.52 0.50. For the Nsyth dataset, the GAAE model has achieved the accuracy of 94.89% 0.01, which is better than the accuracy of the supervised CNN (92.01% 0.94). Furthermore, The relationship between the percentage of the data used as guidance and the test data classification accuracy is shown in table VII, VIII for S09 and Nsyth dataset, respectively. The results from both tables demonstrate that the classification accuracy on the test data increases along with the stability (standard deviation in the accuracy) as we increase the percentage of the data used as guidance. Also, the GAEE model has outperformed other models in terms of achieving better classification accuracy leveraging the minimal amount of label data. From the tables, we can observe that the GAAE model has performed better than the supervised classifier when it is trained with 100% labelled data because the classifier takes the advantage of the generated samples as well as the labelled data. So for any classification task, our GAAE model can be used instead of any supervised classifier. Here, the GAAE model achieved great classification accuracy due to the generation of the samples with superior quality according to different categories.

V-D Representation Learning

The GAAE model learns two types of representations/latent spaces; to learn guidance specific characteristics of the data (Guided representation/post-task-specific representation) and to learn general characteristics of the data (General representation/Style representation).

V-D1 Guided Representation Learning

To investigate the impact of the guidance on the representation, we have visualised the latent in the 2D plain. The figure 5 shows the representation space for S09 test dataset and figure 6 is the visualisation for the Nsynth dataset. From both figures, it is noticeable that the guided categories are clustered together and well separated in the representation space. So, has successfully learned to map the data sample to the representation (latent) space in a way so that the data categories which are used as guidance, are easily separable in the representation space.

V-D2 General Representation/Style Representation

After investigating the generated audios, we have noticed that the voice of the speaker, audio pitch, background noise are the same for any latent sample . Therefore, the generated audio samples for the S09 dataset for a certain and different digit categories have similar speaker voice and background noise. Furthermore, for the Nsynth dataset, we have noticed a similar pattern.

Fig. 5: t-SNE visualisation of the learnt representation of the test data of the S09 dataset. Here, different colours of points represent different digit categories. In the representation space, the different digit categories are clustered together and easily separable.

Fig. 6: t-SNE visualisation of the learnt representation of the test data of the Nsynth dataset. Here, different colours of points represent different instrument categories. In the representation space, the different instrument categories are clustered together and easily separable.

The digit categories of the generated audios are changed according to the given condition and the general characteristics of the audio is changed with the change of , which infers that has learned in a way so that it captures the general attributes of the data. If this is true then pretrained should be able to extract general attribute in latent from any related dataset, which was not used during the training. So to explore this scenario, we have passed the test data from S09 dataset through to get the general representation . Then for a fixed and different conditions (digit categories), we have generated samples from the pretrained network. After converting the generated samples to audios, we have noticed that the generated audios preserved some similar characteristics like speaker gender, voice, pitch, tone, background noise etc. from the input data sample (S09 test data). We also noticed similar scenarios for the Nsynth dataset. The audios can be found on the link: https://bit.ly/36Oz9z9. Here the first second of any audio is the input audio data and rests are the generated audios.

Here, the GAAE model learns general/style attributes of the S09 dataset in the latent, so we can expect that it has also disentangled the gender of the speaker in the latent space. To evaluate this, we have used the trained from the GAAE model to extract latent for an entirely different Librispeech dataset where gender labels are available. For 5000 data randomly sampled from Librispeech dataset, we have extracted the feature/latent from and visualised in 2D plain using t-SNE visualisation for exploration. The figure 7 shows the visualisation and here, we can observe that the latent for the same gender of the speakers are clustered together and easily separable from the latent space. This exploration exhibits that the GAAE model was able to learn gender attributes of the speaker from S09 dataset successfully though gender information of the speaker was never used during the training.

Fig. 7: t-SNE visualisation of the learnt representation of the Libri speech dataset. Here, different colours of points represent the gender of the speakers. The representations of the different gender categories are clustered together.

Now, The figure 8 shows the generated samples for both S09 and Nsynth dataset based on these interpolated points. Hence, from the figure, we can observe that the transition between two spectrograms generated based on two fixed latent and is very smooth. Moreover, when we converted the spectrograms to audio, we observed the same smooth transition, which indicates the disentanglement of the general attributes in the latent space . The audios can be found on the link: https://bit.ly/36Oz9z9

Therefore, it is evident from these explorations that the GAAE model was able to learn pre-specified representation (Guided Representation) as well as the representation for the general attributes/characteristics of the dataset leveraging very few amounts of labelled data as guidance.

Fig. 8: This figure shows the generated spectrograms based on the linear interpolation between two latent samples; and . The first two rows show the generated spectrograms for S09 dataset and the bottom two rows exhibit the spectrograms for the Nsynth dataset. For any particular row, the first and the last spectrograms are the generations based on the fixed two latent points and the in-between spectrograms are the generation based on the interpolation between these two fixed points.

Fig. 9: The figure shows the relationship between the hyperparameters and the measurement metrics of the GAAE model. The top left plot explains the relationship between the and IS score, FID scores. Similarly, the top right explicates the relationship between the and IS scores, FID scores. Here, The bottom left box illustrates the relationship between the and the Classification accuracy. Furthermore, the bottom right plot demonstrates the impact of on the classification accuracy.
Training
Data Size
CNN
Network
BiGAN
GGAN GAAE
1% 82.21 1.2 73.01 1.02 84.21 2.24 90.21 0.16
2% 83.04 0.34 75.56 0.41 85.39 1.24 91.45 0.12
3% 83.78 0.23 78.33 0.07 88.25 0.10 92.67 0.06
4% 84.11 0.34 80.03 0.01 91.02 0.50 93.70 0.05
5% 84.50 1.02 80.84 1.72 92.00 0.87 94.59 0.03
100% 95.52 0.50 86.77 2.61 96.51 0.07 97.68 0.01
TABLE VII: The relationship between the percentage of the data used as the guidance during the training and the S09 test dataset classification accuracy of the GAAE model.
Training
Data Size
CNN
Network
BiGAN
GGAN GAAE
1% 85.76 1.10 82.21 0.84 88.52 0.32 90.26 0.09
2% 89.79 0.51 86.65 0.57 91.69 0.24 92.96 0.07
3% 89.83 0.49 87.21 0.46 91.95 0.20 93.12 0.05
4% 90.52 0.25 87.59 0.41 92.16 0.19 93.73 0.02
5% 91.07 0.31 87.95 0.39 92.45 0.14 94.23 0.02
100% 92.01 0.94 88.09 0.24 93.56 0.09 94.89 0.01
TABLE VIII: The relationship between the percentage of the data used as the guidance during the training and the Nsynth test dataset classification accuracy of the GAAE model.

Vi Impact of the hyperparameters

We tuned the hyperparameters based on the S09 dataset only because the tuning costs extensive amount of resource and time. Then, we used those hyperparameters for other datasets. From equation 6, and are two important hyperparameters for training the GAAE model, where = 1 - . When we increase the , the model focuses more on the generation and less on the reconstruction. Now if we reduce the , the model increases the focus for reconstruction and reduces the focus for the generation. The relationship between and the IS score, FID score, Classification accuracy can be found in figure 9. The optimal value for the is 0.6 and for the it is 0.4. The hyperparameters and from the equation 6 are two main hyperparameters. The value of parameter determines how much the model will focus on generation and reconstruction loss where parameter determines the focus for the classification and latent loss. From figure 9, we can observe that 0.5 is the optimal value for both of the hyperparameters. In the equation 6 there are three more hyperparameters; , and . Hence, and determines the focus for classification loss for labelled data, generated data respectively, where the determines the focus for the latent loss. Here, equal balance is optimal between the classification and latent loss. So we have used 0.25 for , and 0.50 for the .

Vii Conclusion and Lesson Learnt

We have proposed Guided Adversarial Autoencoder (GAAE), where the model learns the conditional audio generation according to the labelled data samples which are used as guidance/supervision during the training. After evaluating the GAAE model based on one-second audio data, we have shown that the GAAE model can outperform the existing literature in terms of quality and mode diversity using only 5% labelled data samples as guidance. Furthermore, we have also proposed the GAAE model as a data augmenting model due to its superior sample generation aptness.

Along with high-fidelity audio generation, our GAAE model was able to disentangle the post-task-specific characteristics/attributes of the data in the learned latent space with fewer labelled data samples as guidance. Therefore, we have demonstrated that the guidance strategy during the training helps the model to focus on specific attributes of the dataset during the representation learning. Moreover, we have also shown that our GAAE model can outperform any supervised classification if it is trained with all the available labelled data. Along with the post-task-specific representation learning, the GAAE model is capable of learning the other variational factors of the training data in a different latent space. Hence, the GAAE model learns guided representation for the specific posterior task at hand and generalised representation for future unknown related jobs.

The GAAE model was evaluated based on the audio of the size of one second; thus, it remains a challenge to make this model work for longer audio samples generation. In the context of representation learning, the GAAE model can be used efficiently for any long audio samples by dividing it into one-second chunks. As we have achieved successful generation and representation with a minimum of 1% labelled data as guidance, we believe that our work will encourage other researchers to explore the GAAE model further for few-shot learning, where the GAAE model can perform similarly with very few number of labelled examples. We built the GAAE model based on the BigGAN architecture thus leaves a great opportunity for the researchers to study the progressive GAN or the Style GAN architecture in the GAAE model.

References

Appendix A Supplement Material

This section presents the details of the neural networks used in this paper. We have followed the abbreviations and description style from paper of Mario et al.[lucic2019highfidelity]

Full Name
Abbreviation
Resample RS
Batch normalisation BN
Conditional batch normalisation cBN
Downscale D
Upscale U
Spectral normalisation SN
Input height h
Input width w
True label y
Input channels ci
Output channels co
Number of channels ch
TABLE IX: Abbreviations for defining the architectures.

A-a Supervised BigGAN

We have taken the exact implementation of the Supervised BigGAN from our GGAN paper [haque2020guided]. Therefore, for the implementation of both Generator and Discriminator, we used Resnet architecture from the BigGAN paper[Andrew_biggan]. The layers are shown in the X and XI. Generator and Discriminator architectures are shown in Table XII and XIII, respectively. We use a learning rate of and for the Generator and the Discriminator, respectively. We set the number of channels (ch) to 16 to minimise the computational expenses, as the higher number of channels such as 64 and 32 only offer negligible improvements.

Layer
Name
Kernal
Size
RS
Output
Size
Shortcut [1,1,1] U 2h 2w c_{o}

cBN, ReLU

- - h w c_{i}
Convolution [3,3,1] U 2h 2w c_{o}
cBN, ReLU - - 2h 2w c_{o}
Convolution [3,3,1] U 2h 2w c_{o}
Addition - - 2h 2w c_{o}
TABLE X: Architecture of the ResBlock generator with upsampling for the supervised BigGAN.
Layer
Name
Kernal
Size
RS
Output
Size
Shortcut [1,1,1] D h/2 w/2 c_{o}
ReLU - - h w c_{i}
Convolution [3,3,1] - h w c_{o}
ReLU - - h w c_{o}
Convolution [3,3,1] D h/2 w/2 c_{o}
Addition - - h/2 w/2 c_{o}
TABLE XI: Architecture of the ResBlock discriminator with downsampling for the supervised BigGAN.
Layer
Name
RS SN
Output
Size
Input z - - 128
Dense - - 4 2 16. ch
ResBlock U SN 8 4 16. ch
ResBlock U SN 16 8 16. ch
ResBlock U SN 32 16 16. ch
ResBlock U SN 64 32 16. ch
ResBlock U SN 128 64 16. ch
Non-local block - - 128 64 16. ch
ResBlock U SN 256 128 1. ch
BN, ReLU - - 256 128 1
Conv [3, 3, 1] - - 256 128 1
Tanh - - 256 128 1
TABLE XII: Architecture of the generator for the supervised BigGAN.
Layer
Name
RS
Output
Size
Input
Spectrogram
- 256 128 1
ResBlock D 128 64 1. ch
Non-local block - 128 64 1. ch
ResBlock - 64 32 1. ch
ResBlock D 32 16 2. ch
ResBlock D 16 8 4. ch
ResBlock D 8 4 8. ch
ResBlock D 4 2 16. ch
ResBlock
(No Shortcut)
- 4 2 16. ch
ReLU - 4 2 16. ch
Global sum pooling - 1 1 16. ch
Sum(embed(y)·h)+(dense 1) - 1
TABLE XIII: Architecture of the discriminator for the supervised BigGAN.

A-B Unsupervised BigGAN

Similarly, for the unsupervised BigGAN, we have followed the same implementation from GGAN paper [haque2020guided]. The table XIV and XV shows the upsampling and downsampling layers respectively. The architectures of Generator and Discriminator are shown in the Table XVI and XVII, respectively. The Learning rate and channels are the same as supervised BigGAN.

Layer
Name
Kernal
Size
RS
Output
Size
Shortcut [1,1,1] U 2h 2w c_{o}
BN, ReLU - - h w c_{i}
Convolution [3,3,1] U 2h 2w c_{o}
BN, ReLU - - 2h 2w c_{o}
Convolution [3,3,1] U 2h 2w c_{o}
Addition - - 2h 2w c_{o}
TABLE XIV: Architecture of the ResBlock generator with upsampling for the unsupervised BigGAN.
Layer
Name
Kernal
Size
RS
Output
Size
Shortcut [1,1,1] D h/2 w/2 c_{o}
ReLU - - h w c_{i}
Convolution [3,3,1] - h w c_{o}
ReLU - - h w c_{o}
Convolution [3,3,1] D h/2 w/2 c_{o}
Addition - - h/2 w/2 c_{o}
TABLE XV: Architecture of the ResBlock discriminator with downsampling for the unsupervised BigGAN.
Layer
Name
RS SN
Output
Size
Input z - - 128
Dense - - 4 2 16. ch
ResBlock U SN 8 4 16. ch
ResBlock U SN 16 8 16. ch
ResBlock U SN 32 16 16. ch
ResBlock U SN 64 32 16. ch
ResBlock U SN 128 64 16. ch
Non-local block - - 128 64 16. ch
ResBlock U SN 256 128 1. ch
BN, ReLU - - 256 128 1
Conv [3, 3, 1] - - 256 128 1
Tanh - - 256 128 1
TABLE XVI: Architecture of the generator for the unsupervised BigGAN.
Layer
Name
RS
Output
Size
Input
Spectrogram
- 256 128 1
ResBlock D 128 64 1. ch
Non-local block - 128 64 1. ch
ResBlock - 64 32 1. ch
ResBlock D 32 16 2. ch
ResBlock D 16 8 4. ch
ResBlock D 8 4 8. ch
ResBlock D 4 2 16. ch
ResBlock
(No Shortcut)
- 4 2 16. ch
ReLU - 4 2 16. ch
Global sum pooling - 1 1 16. ch
Dense - 1
TABLE XVII: Architecture of the discriminator for the unsupervised BigGAN.

A-C BiGAN

For the BiGAN model, we have trained Feature Extractor and Discriminator network top of the unsupervised BigGAN. The Feature Extractor network creates the features for real samples, and Discriminator tries to differentiate between the generated features and the random noise. The detail is exactly followed from the BiGAN paper [DonahueKD16]. The downsampling layer is same as the unsupervised BigGAN and can be found on table XV. The architecture of the Feature Extractor network is shown in table XVIII. Furthermore, the architecture of the Discriminator is given on table XIX.

Layer
Name
RS
Output
Size
Input
Spectrogram
- 256 128 1
ResBlock D 128 64 1. ch
Non-local block - 128 64 1. ch
ResBlock - 64 32 1. ch
ResBlock D 32 16 2. ch
ResBlock D 16 8 4. ch
ResBlock D 8 4 8. ch
ResBlock D 4 2 16. ch
ResBlock
(No Shortcut)
- 4 2 16. ch
ReLU - 4 2 16. ch
Global sum pooling - 1 1 16. ch
Dense - 128
TABLE XVIII: Architecture of the Feature Extractor Network for the BiGAN.
Layer
Name
RS
Output
Size
Input
Spectrogram
- 256 128 1
ResBlock D 128 64 1. ch
Non-local block - 128 64 1. ch
ResBlock - 64 32 1. ch
ResBlock D 32 16 2. ch
ResBlock D 16 8 4. ch
ResBlock D 8 4 8. ch
ResBlock D 4 2 16. ch
ResBlock
(No Shortcut)
- 4 2 16. ch
ReLU - 4 2 16. ch
Global sum pooling - 1 1 16. ch
Concat with input feature - 256+128=384
Dense - 128
ReLU - 128
Dense - 1
TABLE XIX: Architecture of the Discriminator for the BiGAN.

A-D Gaae

In the GAAE model, the downsampling and upsampling layers are the same as those shown in table X and XI, respectively.

The Encoder architecture is given in table XX where we used two Dense layers to get the and from Global sum pooling layer. For the Decoder, the conditional vector or is given through the conditional Batch Normaliser (cBN) from the upsampling layer. The Classifier network is built upon some Dense layer, and the architecture is given in table XXII. For the Sample Discriminator, we have exactly followed the implementation from table XIII. Here, in the table XIII, the y is the conditional vector and h is the output from Global sum pooling layer. For the Latent Discriminator, we have used multi dense layers, and the architecture is given in the table XXIII.

The learning rates for both Discriminators are , and for other networks, the learning rate was . We set the number of channels to for all the experiment carried out with GAAE.

A-E Simple Classifier

For many classification tasks, we have mentioned about Simple Classifier throughout the paper. The architecture of these classifiers are followed from the table XXIV. Here, c is the number of output according to the classification categories. The learning rates is used as for this Classifier network.

Layer
Name
RS
Output
Size
Input
Spectrogram
- 256 128 1
ResBlock D 128 64 1. ch
Non-local block - 128 64 1. ch
ResBlock - 64 32 1. ch
ResBlock D 32 16 2. ch
ResBlock D 16 8 4. ch
ResBlock D 8 4 8. ch
ResBlock D 4 2 16. ch
ResBlock
(No Shortcut)
- 4 2 16. ch
ReLU - 4 2 16. ch
Global sum pooling - 1 1 16. ch
Dense (), Dense () - 128, 128
TABLE XX: Architecture of the Encoder for the GAAE.
Layer
Name
RS SN
Output
Size
Input latent vector - - 128
Dense - - 4 2 16. ch
ResBlock U SN 8 4 16. ch
ResBlock U SN 16 8 16. ch
ResBlock U SN 32 16 16. ch
ResBlock U SN 64 32 16. ch
ResBlock U SN 128 64 16. ch
Non-local block - - 128 64 16. ch
ResBlock U SN 256 128 1. ch
BN, ReLU - - 256 128 1
Conv [3, 3, 1] - - 256 128 1
Tanh - - 256 128 1
TABLE XXI: Architecture of the Decoder for the GAAE.
Layer
Name
Output
Size
Input latent vector 128
Dense 128
ReLU 128
Dense 10
TABLE XXII: Architecture of the Classifier for the GGAN.
Layer
Name
Output
Size
Input latent vector 128
Dense 128
ReLU 128
Dense 128
ReLU 128
Dense 1
TABLE XXIII: Architecture of the Latent Discriminator for the GGAN.
Layer
Name
Output
Size
Input
Spectrogram
256 128 1
Convolution [3, 3, 32] 256 128 32
Maxpool [2, 2] 128 64 32
Convolution [3, 3, 64] 128 64 64
Maxpool [2, 2] 64 32 64
Convolution [3, 3, 128] 64 32 128
Maxpool [2, 2] 32 16 128
Convolution [3, 3, 256] 32 16 256
Maxpool [2, 2] 16 8 256
Dense c
TABLE XXIV: Architecture of the Simple Spectrogram Classifier.

Kazi Nazmul Haque is a PhD student at University of Southern Queensland, Australia. He has been working professionally in the field of machine learning for more than five years. Kazi’s research work focuses on building machine learning models to solve diverse real-world problems. The current focus of his research work is unsupervised representation learning for the audio data. He has completed his Master in Information Technology from Jahangirnagar University, Bangladesh.

Rajib Rana is an experimental computer scientist, Advance Queensland Research Fellow and a Senior Lecturer in the University of Southern Queensland. He is also the Director of IoT Health research program at the University of Southern Queensland. He is the recipient of the prestigious Young Tall Poppy QLD Award 2018 as one of Queensland’s most outstanding scientists for achievements in the area of scientific research and communication. Rana’s research work aims to capitalise on advancements in technology along with sophisticated information and data processing to better understand disease progression in chronic health conditions and develop predictive algorithms for chronic diseases, such as mental illness and cancer. His current research focus is on Unsupervised Representation Learning. He received his B.Sc. degree in Computer Science and Engineering from Khulna University, Bangladesh with Prime Minister and President’s Gold medal for outstanding achievements and Ph.D. in Computer Science and Engineering from the University of New South Wales, Sydney, Australia in 2011. He received his postdoctoral training at Autonomous Systems Laboratory, CSIRO before joining the University of Southern Queensland as Faculty in 2015.

Björn W. Schuller (M’05-SM’15-F’18) received his diploma in 1999, his doctoral degree for his study on Automatic Speech and Emotion Recognition in 2006, and his habilitation and Adjunct Teaching Professorship in the subject area of Signal Processing and Machine Intelligence in 2012, all in electrical engineering and information technology from TUM in Munich/Germany. He is Professor of Artificial Intelligence in the Department of Computing at the Imperial College London/UK, where he heads GLAM –- the Group on Language, Audio & Music, Full Professor and head of the ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing at the University of Augsburg/Germany, and CEO of audEERING. He was previously full professor and head of the Chair of Complex and Intelligent Systems at the University of Passau/Germany. Professor Schuller is Fellow of the IEEE, Golden Core Member of the IEEE Computer Society, Senior Member of the ACM, President-emeritus of the Association for the Advancement of Affective Computing (AAAC), and was elected member of the IEEE Speech and Language Processing Technical Committee. He (co-)authored 5 books and more than 800 publications in peer-reviewed books, journals, and conference proceedings leading to more than overall 25 000 citations (h-index = 73). Schuller is general chair of ACII 2019, co-Program Chair of Interspeech 2019 and ICMI 2019, repeated Area Chair of ICASSP, and former Editor in Chief of the IEEE Transactions on Affective Computing next to a multitude of further Associate and Guest Editor roles and functions in Technical and Organisational Committees.