Neural Architecture Search on Acoustic Scene Classification

12/30/2019 ∙ by Jixiang Li, et al. ∙ UNSW Xiaomi 0

Convolutional neural networks are widely adopted in Acoustic Scene Classification (ASC) tasks, but they generally carry a heavy computational burden. In this work, we propose a lightweight yet high-performing baseline network inspired by MobileNetV2, which replaces square convolutional kernels with unidirectional ones to extract features alternately in temporal and frequency dimensions. Furthermore, we explore a dynamic architecture space built on the basis of the proposed baseline with the recent Neural Architecture Search (NAS) paradigm, which first trains a supernet that incorporates all candidate networks and then applies a well-known evolutionary algorithm NSGA-II to discover more efficient networks with higher accuracy and lower computational cost. Experimental results demonstrate that our searched network is competent in ASC tasks, which achieves 90.3 5 evaluation set, marking a new state-of-the-art performance while saving 25 of FLOPs compared to our baseline network.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Acoustic Scene Classification (ASC) is an important task in the field of audio understanding and analysis, which classifies an audio stream into one of the predefined acoustic scenes. Recently, ASC has drawn increased attention from academy to industry, due to its great potential in many applications like context aware devices

[eronen2005audio], acoustic monitoring [ntalampiras2009acoustic], and assistive technologies [bugalho2009detecting]

. In the early stages, many traditional machine learning methods such as GMM

[mesaros2016tut], HMM [eronen2005audio] and SVM [mesaros2018detection]

have been applied. Now with the great success of deep learning in computer vision and the availability of larger audio datasets, methods based on DNN

[mun2017deep], CNN [han2016acoustic] and RNN [vu2016acoustic] have gradually been dominant in ASC. Among them, CNN-based methods have obtained the most state-of-the-art results because of their excellent capability of learning high-level features by exploiting time-frequency pattern of a signal [mesaros2018acoustic]. For example, Inoue [inoue2018domestic] ensembled four CNN models that championed in DCASE 2018 task 5 [Dekkers2018_DCASE]. Hershey [hershey2017cnn] compared different CNN architectures (VGG [simonyan2014very], Xception [chollet2017xception] and ResNet [he2016deep] etc.) in an audio classification task and all of them showed promising results. However, in vision tasks, most architectures above have been outperformed by advanced networks such as MobileNetV2 [sandler2018mobilenetv2] in terms of the number of parameters and computational cost.

To automatically get higher-performance networks than those designed by human experts, neural architecture search (NAS) has been adopted in many vision and NLP tasks. A few have applied NAS in speech like keyword spotting [veniat2019stochastic, mazzawi2019improving]

. We are thus motivated to validate NAS applicability in ASC task. Early NAS methods based on reinforcement learning

[zoph2018nasnet] and evolutionary algorithms [real2019AmoebaNet] consume immense GPU resources to evaluate thousands of neural networks. Recent gradient-based methods [liu2019darts] require only a small amount of GPUs, but it is less robust and has weak reproducibility. On the contrary, the one-shot [bender2018oneshot] paradigm is more promising mainly for three reasons: robustness to generate powerful architectures, moderate consumption of GPU resources, and convenience for solving multi-objective problems (MOPs, e.g. searching for a model with a trade-off between accuracy and computational cost). Generally, a supernet that comprises of all candidate architectures is constructed and trained fully on a target dataset. Next, by inheriting weights directly from the supernet, candidates can have a good performance, from which we can judge competitive ones from the rest. Beyond [bender2018oneshot], single-path one-shot [guo2019single] and FairNAS [chu2019fairnas] also exhibit improved training stability and more accurate ranking.

In this paper, we propose a lightweight network of high-performance inspired by MobileNetV2’s inverted bottleneck block. On this basis, we also search for better architectures with the one-shot NAS doctrine. Many previous works in vision used CNNs as a feature extractor, mostly with square kk kernels since the information has the same nature in the spatial resolution. For ASC’s mel-spectrogram, however, horizontal and vertical information has different implications, namely, temporal relation and frequency distribution. Square kernels are thus less appropriate. Based on the above consideration, we adapt MobileNetV2 with unidirectional (1k and k1) kernels to separately handle each dimension’s information. Experiments in section 3.3 demonstrate the proposed unidirectional convolution network has outstanding performance in DCASE 2018 Task 5 [Dekkers2018_DCASE]. Further, we leverage fair supernet training strategy [chu2019fairnas] and NSGA-II algorithm [deb2002fast] to search for network architectures with higher accuracies and less computation. In section 3.4 we illustrate the details of the searched network that achieve state-of-the-art results on DCASE 2018 task 5.

2 Acoustic Scene Classification with Neural Architecture Search

2.1 Proposed Baseline Network

In view of MobileNetV2’s recent success in speech applications [luo2019conv, kriman2019quartznet], we incorporate the MobileNetV2’s bottleneck blocks [sandler2018mobilenetv2]

as the core components in our baseline network. Our baseline mainly consists of three parts, namely a feature extractor (FE) module, a two-level gated recurrent unit (GRU) layer and two fully-connected (FC) layers. Specifically, the FE module comprises one stem convolution layer, 20 bottleneck blocks with 1

1-depthwise-11 convolution structure and one global frequency-dimensional convolution layer. We use unidirectional (1k and k1) kernels to replace original 33 kernels in depthwise convolution of [sandler2018mobilenetv2]. The configuration of this architecture is shown in Figure 1. This FE module design is based on the following considerations. First, the residual property [he2016deep]

of bottleneck block helps backpropagation and prevents vanishing gradients. Second, unidirectional kernels are utilized to extract features more attentively in temporal dimension and frequency dimensions. Besides, they reduce the number of parameters to prevent overfitting. Last, extracting features alternately in temporal and frequency dimensions is conducive to information flow and fusion.

MB K31 S11 F32

MB K13 S11 F48

MB K31 S11 F64

MB K13 S11 F80

MB K31 S11 F96

MB K13 S11 F112

Stem Conv K77 S11 F24


MB K31 S21 F32

MB K31 S11 F32


MB K13 S12 F48

MB K13 S11 F48


MB K31 S21 F64

MB K31 S11 F64


MB K13 S12 F80

MB K13 S11 F80


MB K31 S21 F96

MB K31 S11 F96


MB K15 S14 F112

MB K13 S11 F112


Global Conv K51 S11 F128

GRU (256)

MaxPooling K14 S14 (64)

Flatten & FC (512)

FC (9)

MB K31 S11 F96

MB K13 S11 F112
Figure 1:

The proposed baseline adapted from MobileNetV2 bottleneck (MB) for ASC. Note K and S refer to kernel size and stride. F means filters. The expansion rate of MB is 6.

In the forward phase, a log mel-spectrogram transformed from original audio is fed as an input to the FE module, then the GRU layer receives high-dimensional features from the extractor and outputs embeddings with rich temporal information. Finally, the FC layers process the embeddings and give scene predictions activated by a softmax function.

2.2 Neural Architecture Search

Based on our baseline network, we harness NAS methodology to search for architectures with better performance. Concretely, we let our search targets be the kernel sizes and expansion rates in each inverted bottleneck block, which sums up to an enormous search space. Our supernet thus consists of this searchable feature extractor along with the original GRU and FCs. We then train the supernet on the target dataset to obtain an evaluator for a good ranking among candidate models. Next, we use NSGA-II [deb2002fast], an advanced evolutionary algorithm that features multi-objective optimization, to seek promising architectures with higher accuracy and less computation. Final competent architectures are selected from the Pareto-front obtained by NSGA-II and are trained from scratch. The overall NAS pipeline is illustrated in Figure 2.

Figure 2: The NAS pipeline consists of supernet training [chu2019fairnas] and NSGA-II Searching [deb2002fast].

2.2.1 Search Space

We now describe the details of the search space for the FE module, where a certain number of bottleneck blocks with different settings can be chosen in each layer. We allow an expansion rate in {3, 6} for the first 11 convolution and a kernel size in {3, 5, 7} for the depthwise convolution in each block as shown in Figure 3. For layer 17, we make an exception due to its large downsampling rate, where its expansion rate ranges in {3, 6, 8} and its kernel size in {5, 7}. The output filters of each block and the downsampling positions are fixed as in the baseline. In total, we have 20 searchable layers in the FE module, each with 6 choices of MB blocks. Hence, our search space contains possible architectures. It is very hard to search for better architectures from such an enormous search space simply by trial-and-error. For convenience’s sake, we refer the c-th choice block B in layer l as , where and .

MB K31 S11 F32

MB K13 S11 F48

MB K31 S11 F64

MB K13 S11 F80

MB K31 S11 F96

MB K13 S11 F112


11 Conv-BN-ReLU6

MB Block


11 Conv-BN


expansion rate














for 1n kernels

for n1 kernels

MB K31 S11 F96

MB K13 S11 F112

Figure 3: The searchable MobileNetV2’s bottleneck blocks.
Figure 4: Training the supernet with strict fairness as in FairNAS [chu2019fairnas]. Each colored circle (6 per layer) is a choice block.

2.2.2 Supernet Training Strategy

We construct a supernet with the above searchable FE module, the original GRU and FCs together. To train the supernet, we use the same fairness strategy proposed in our previous work [chu2019fairnas], as illustrated in Figure 4. Specifically, given a mini-batch of training data, instead of training the supernet as a whole, we uniformly sample without replacement to have 6 models with no shared blocks for separate training, i.e. in the first step we randomly select one choice from the 6 choices in each search layer to construct a model and the gradients of all parameters in this model are calculated by back-propagation. In the same way, in the second step we randomly pick one choice from the remaining unselected choices in each layer to get another model and calculate its gradients. The rest may be deduced by analogy. After the sixth step, all choices in each layer have been selected and their gradients are obtained. Finally, we update the trainable parameters of the supernet according to the corresponding gradients altogether. The training algorithm is detailed in Algorithm 1. Through this method, each choice has the same opportunity to update itself using every mini-batch of data, which stabilizes the training process.

The trained supernet is then used to evaluate candidate models. In particular, we sample a candidate model from the supernet with its trained weights, and evaluate it on the validation set. In this way, we can quickly get the approximate performance of each candidate model in the supernet without extra training. As proved in our previous work [chu2019fairnas], the performance ranking among models obtained through this fair supernet training strategy is highly consistent with those trained from scratch.

  Input: training data loader , the number of search layers, the number of choice blocks each layer, the number

of training epochs, choice set

for each search layer
  Output: the supernet with trained parameters
  for  to  do
     for data , ground-truth in  do
        clear gradients of all supernet parameters
        for  to  do
           initialize for each search layer if c == 1
           for  to  do
              randomly select an element e from
              delete e from
              get the choice as
           end for
           construct by (, , …, )
           calculate gradients of parameters by ,
        end for
        update all trainable parameters of the supernet by gradients
     end for
  end for
Algorithm 1 Supernet Training Strategy

2.2.3 Search Strategy

There are many ways to search, such as random searching, reinforcement learning (RL), and evolutionary algorithms (EA). Here we utilize an efficient evolutionary algorithm NSGA-II [deb2002fast] to search for promising architectures in the enormous search space. We adapt NSGA-II to our needs and only describe the differences in this section. Please refer to its original paper for the rest details (non-dominated sorting, crowding distance, tournament selection, and Pareto-front, etc.). In this paper, an architecture is regarded as an individual and encoded uniquely by one chromosome. According to the illustration in Section 2.2.1, we define the chromosome of an architecture as a list containing 20 elements corresponding to 20 searchable layers. Each gene in the chromosome is an integer ranges from 1 to 6 corresponding to 6 choices. Since we aim to search for architectures with higher accuracy and lower computation, we set the accuracy metric and computational cost as two objectives. A population of P=64 architectures are chosen for evolution which iterates for I=70 of generations. For crossover, we get two champions (the architectures who defeat each opponent’s architecture on objectives) using tournament selection, then select two gene spots and swap genes at corresponding spots on these two champion chromosomes. As for mutation, we select one to four gene spots randomly and change gene values on these spots. In the early stage of the search, we set exploration ratio 100% to explore enormous search space to find promising architectures by creating chromosomes randomly. As the search progressing, exploitation ratio , i.e. the number of individuals created by crossover and mutation / P, increases to 80% gradually with the exploration ratio (1-) decreassing to 20%, which pays more attention to searching for better ones near the searched promising architectures. The exploitation ratio is defined as


where i refers to the evolutionary iteration. The whole search algorithm is shown in Algorithm  2.

  Input: population size , max iteration , trained supernet weights , mutation m, crossover c, tournament selection ts [deb2002fast], objective1 obj1 for accuracy on validation set, objective2 obj2 for FLOPs, non-dominated sorting and crowding distance sorting nds-cds [deb2002fast]
  Output: the best population list
  initialize best population list bp =
  for  to  do
     chromosome list cl = , population list p =
     update exploitation ratio by Eq. 1
     cl chromosomes by creating randomly
     cl (1-) chromosomes by m(c(bp, ts))
     for each chromosome c in  do
        construct model by and
        accuracy = obj1(model)
        flops = obj2(model)
        restore (, accuracy, flops) into p
     end for
     p bp p
     bp Top Pareto-optimal individuals by nds-cds on p
  end for
Algorithm 2 Search Strategy

3 Experiments and Analysis

3.1 Dataset and Data Augmentation

We evaluate our models on the task 5 dataset of DCASE 2018 Challenge [Dekkers2018_DCASE] containing audios of 9 domestic activities. The whole dataset is divided into the development set (72984 segments) and the evaluation set (72972 segments). Each segment has four acoustic channels. We extract 40 log-mel band energies for each channel signal with a frame size of 40ms and hop size of 20ms, that gives us 40501 data matrix for each sample.

For data augmentation, we adopt the shuffling and mixing as in [inoue2018domestic] offline because of the highly imbalanced class distribution of the dataset. In particular, we increase the number of segments of minority classes (cooking, dishwashing, eating, other, social activity and vacuum cleaning) to 14K and the rest to 28K. Besides, Cutout [devries2017cutout] is used online when training to improve the regularization of models.

3.2 Training and Evaluation Setup

We train our models using Adam optimizer [kingma2014adam] with a weight decay of 1e-6 on 2 GPUs. We warm up the learning rate from 0 to 0.003 in the first three epochs and keep this learning rate in subsequent epochs. An exponential moving average with decay 0.9986 is used. The batch size is 192.

We evaluate our models on the entire evaluation set in two ways: single-mode and ensemble-mode. For single-mode, we just train one single model on augmented development set for 10 epochs, then evaluate it using the last checkpoint weights. For ensemble-mode, we use the official cross-validation folds [Dekkers2018_DCASE] and train four models on these four folds, respectively. As for scoring, we calculate macro-F1 score [Dekkers2018_DCASE] of each channel signal and then average them for the final score.

3.3 Experiments on Baseline Network

As an outperforming feature extractor, VGG [simonyan2014very] architecture is widely used by previous works like [Han2017vgglike] [tanabe2018multichannel] [Weimin2019attention] that achieve good performance, but its size and depth also bring the problem of high computational cost. As a comparison, we replace the FE module with original VGG16 while the experimental settings remain the same as our baseline to explore the relationship between the model capacity and the ability to extract features. In this section, we train our models in single-mode.

Model Input Params FLOPs F1 Score
(H, W) (M) (G) (%)
[Weimin2019attention]-ensemble (64, 1250) 26.65 17.98 89.1
BaselineVGG (40, 501) 17.24 5.99 89.4
Baseline (40, 501) 3.31 2.03 89.8
Table 1: Performance comparisons with different feature extractor and input size.

We can see from Table 1 that our baseline model is 0.4% F1 higher than the comparison model which just replaces the FE with original VGG16, while reducing 4G FLOPs and 14M parameters. It suggests that capacity and depth of model have no absolute relationship with the ability to extract features and demonstrates our proposed architecture with unidirectional kernels alternately in temporal dimension and frequency dimensions is efficient for the ASC task. Moreover, our baseline, as a single model with smaller input size, also outperforms [Weimin2019attention] which based on VGG architecture and ensembled by four models. The experiments in this section show that our proposed FE architecture can be used as a light-weight and high-performance feature extractor.

3.4 Experiments on NAS

3.4.1 Search Result

Figure 5: (a) Searched architectures by NSGA-II. (b) Searched architectures by Random Search. (c) Pareto-optimal architectures from the searched architectures.

In this section, we divide each category of the development set into a training set and a validation set in a 70%:30% ratio and augment the training set based on the strategy in section 3.1. We train our supernet for 10 epochs on the augmented training set using section 2.2.2 training strategy, then evaluate more than 4.4K candidate models on the validation set by the section 2.2.3 method. As a comparison, we also evaluate the same amount of candidates by Random Search instead of NSGA-II.

From Figure 5 (a) we can see that in the early stage some promising models and lots of mediocre models are explored because of the large exploration rate. As the evolution iterating, the exploitation rate increases gradually, so that models near the searched promising models will be searched by crossover and mutation. In the later period of the evolution, due to the large exploitation ratio, more attention is invested in crossover and mutation so that more promising models are found. The Figure 5 (b) clearly shows that the search area of Random Search only almost coincides with the early search area of NSGA-II, suggesting that Random Search is relatively weak on problems with multi-objective constraints. The Figure 5

(c) shows the Pareto optimal models searched by NSGA-II and Random Search, which demonstrates that NSGA-II is more powerful than Random Search in searching as it tends to search for better models with higher evaluation metric and less computation.

3.4.2 Performances on DCASE 2018 Task5

Class [Dekkers2018_DCASE]111Entire Evaluation Set of DCASE2018 task5 was divided into unknown microphone set (0.4286=3/7) and known microphone set (0.5714=4/7) for each category. Based on the ratio of the two sets, we can calculate the F1-score for entire Evaluation Set based on the official published leaderboard, e.g. the F1-score of category ”Absence” in [Dekkers2018_DCASE] is 88.7%=87.7%0.4286+89.4%0.5714. [inoue2018domestic]111Entire Evaluation Set of DCASE2018 task5 was divided into unknown microphone set (0.4286=3/7) and known microphone set (0.5714=4/7) for each category. Based on the ratio of the two sets, we can calculate the F1-score for entire Evaluation Set based on the official published leaderboard, e.g. the F1-score of category ”Absence” in [Dekkers2018_DCASE] is 88.7%=87.7%0.4286+89.4%0.5714. [Weimin2019attention] B-s N-s N-e
Absence 88.7 94.0 92.7 93.5 93.1 93.9
Cooking 94.9 94.5 93.8 95.7 96.7 96.1
Dishwashing 78.5 87.6 86.6 88.5 89.2 88.9
Eating 81.7 88.7 88.0 88.0 88.5 88.7
Other 40.2 57.3 58.8 58.6 59.3 60.1
Social activity 96.5 97.2 97.9 97.4 97.0 97.8
Vacuum clean 95.9 97.2 95.3 97.1 97.4 97.7
Watching TV 99.9 100.0 100.0 100.0 100.0 100.0
Working 81.5 89.4 88.4 89.3 88.9 89.7
F1 score 84.2 89.5 89.1 89.8 90.0 90.3
Table 2: Class-wise performance comparison. Note B-s denotes the single baseline model, N-s denotes the single NASC-net model, N-e denotes the ensembled NASC-net model.

Due to space limitation, we only select one searched model which has 1.53G FLOPs and 3.01M parameters (saving 25% of FLOPs and 9% of parameters compared to the baseline) on the Pareto-optimal front of the 70th evolutionary iteration, though we can give lots of promising models after only one search process. We name this model NASC-net and the searched part of this model is illustrated in Figure 6. NASC-net looks very weird in architectural perspective and it’s almost impossible for human experts to design such an architecture, so we select it as a typical searched model and train it from scratch in single and ensemble mode, respectively. The performances on the evaluation set of DCASE 2018 task 5 is shown in Table 2. We can see that the smaller single NASC-net outperforms the baseline model in terms of macro F1 score, suggesting our NAS method can search more light-weight and more accurate models than expert-designing. The ensembled NASC-net has further improvement in F1 score than the single NASC-net while the computation increases by four times, and outperforms or performs on par with the best model of the DCASE 2018 task 5 for almost all scenes.

MB K31 S11 F32

MB K13 S11 F48

MB K31 S11 F64

MB K13 S11 F80

MB K31 S11 F96

MB K13 S11 F112


MB K51 S21 E6 F32

MB K71 S11 E6 F32

MB K51 S11 E3 F32

MB K17 S12 E6 F48

MB K15 S11 E6 F48

MB K13 S11 E3 F48

MB K51 S21 E3 F64

MB K71 S11 E3 F64

MB K31 S11 E3 F64

MB K13 S12 E3 F80

MB K13 S11 E6 F80

MB K15 S11 E3 F80

MB K51 S21 E3 F96

MB K51 S11 E3 F96

MB K31 S11 E3 F96

MB K31 S11 E6 F96

MB K15 S14 E8 F112

MB K13 S11 E3 F112

MB K17 S11 E6 F112

MB K15 S11 E6 F112

MB K31 S11 F96

MB K13 S11 F112

Figure 6: The searched part of the feature extractor module. Note that K, S, E and F denote the kernel size, stride, expansion rate and filters, respectively.

4 Conclusions

In this paper, we present a novel and efficient network for ASC tasks where its feature extractor is inspired by MobileNetV2. We show that our proposed network can achieve both high performance and low computation. Besides, on the basis of the proposed network, we apply neural architecture search to achieve a more sophisticated architecture using the fairness supernet training strategy and NSGA-II algorithm. Our searched network obtains a new state of the art on the DCASE2018 Task5 dataset with much lower computation. We can conclude that NAS is applicable in the field of ASC and potentially in other acoustics domains.