Underspecification and fairness in machine learning (ML) applications have
recently become two prominent issues in the ML community. Acoustic scene
classification (ASC) applications have so far remained unaffected by this
discussion, but are now becoming increasingly used in real-world systems where
fairness and reliability are critical aspects. In this work, we argue for the
need of a more holistic evaluation process for ASC models through disaggregated
evaluations. This entails taking into account performance differences across
several factors, such as city, location, and recording device. Although these
factors play a well-understood role in the performance of ASC models, most
works report single evaluation metrics taking into account all different strata
of a particular dataset. We argue that metrics computed on specific
sub-populations of the underlying data contain valuable information about the
expected real-world behaviour of proposed systems, and their reporting could
improve the transparency and trustability of such systems. We demonstrate the
effectiveness of the proposed evaluation process in uncovering
underspecification and fairness problems exhibited by several standard ML
architectures when trained on two widely-used ASC datasets. Our evaluation
shows that all examined architectures exhibit large biases across all factors
taken into consideration, and in particular with respect to the recording
location. Additionally, different architectures exhibit different biases even
though they are trained with the same experimental configurations.
ASC has been established as a central task of artificial auditory intelligence, as exemplified by its prominent place in the DCASE challenge and workshop series DCASE2017challenge, Mesaros2018_TASLP and a generally broad accumulation of literature Liu1998, barchiesi2015acoustic, Rakotomamonjy2017, Kun2017, ren2018deep.
Overall, model performance has substantially improved through the years, and datasets have accordingly evolved to accommodate new challenges by incorporating factors shown to impact model performance.
For example, the exact geographical location of the recordings was identified as an important factor early on, with datasets accordingly adapted by keeping data from the same location in the same partitions DCASE2017challenge, Mesaros2018_TASLP.
The TUT Urban Acoustic Scenes 2018 Mobile dataset additionally introduced the recording device as a separate factor Mesaros2018_DCASE, with the development set consisting of multiple recording devices, and the evaluation set including an extra, unseen device.
Finally, the TAU Urban Acoustic Scenes 2019 dataset highlighted the importance that the city of origin has by introducing data from two additional cities in the evaluation set Mesaros2018_DCASE.
In general, the community is aware of the influence that recording devices and location have on model performance Mesaros2019, heittola2020acoustic.
Most works approach these factors from the perspective of domain mismatch ben2007analysis: different cities, locations, and devices, result in slightly different input representations, and the difference needs to be accounted for to improve overall performance.
Several approaches have been proposed to mitigate the problem, largely drawing from the wide literature of domain adaptation techniques ben2007analysis adapted for the ASC problem Gharib2018, drossos2019_unsupervised, ren2020caa, or specifically taking steps to mitigate the effects of city and device Chen2019, Komider2019.
In this work, we adopt a different perspective: we argue that those factors deserve a prominent place in the evaluation of ASC systems as they reveal important insights about the behaviour of trained models.
To do that, we adopt the language of recent works in the ML fairness literature.
In particular, we propose disaggregated evaluations, a concept highlighted by mitchell2019model as a means to expose the effects that these underlying factors have on system performance.
Disaggregation, which corresponds to breaking down an evaluation to more fine-grained levels of analysis, can be done both in a unitary (how performance is affected by each factor independently) and in an intersectional way (how performance is affected by combination of factors).
For the task of ASC, we consider the three aforementioned factors, namely location, city, and device, as warranting a closer investigation.
This choice is primarily motivated by availability (the existing metadata is already there) and community awareness (past works take them into account).
The rest of this document is organised as follows.
In Section2, we formulate our research question by discussing fairness and underspecification for ASC.
Our methodological approach, including a description of the data and DL architectures used in our experiments, is outlined in Section3.
The results and a discussion of our disaggregated evaluations are presented in Section4.
Finally, we summarise our findings in Section5.
2 Fairness and underspecification in Asc
The success and increased usage of ML, and in particular DL, systems in commercial applications has led to rising concerns towards discriminating biases exhibited by ML applications, for instance based on race Wang_2019_ICCV.
Especially in the case of DL, a lack of interpretability can often be observed Burkart2021, thus posing additional challenges to discover and mitigate said biases.
Even though ASC models are not widely considered high risk applications, their increasing usage in smart city bello2018sound, security radhakrishnan2005audio, elderly monitoring megret2010immed, and autonomous driving nandwana2016towards applications means they may soon (or already) be part of critical decision making systems, thus making fairness a critical consideration for those algorithms.
Of the three factors, the recording device is perhaps the most benign; it is hard to justify why an ASC system that only works for specific devices should raise ethics concerns, although low-income groups could be excluded if data are only collected with high-end equipment.
On the other hand, city and location (which corresponds e. g. to specific neighbourhoods) pose potentially bigger problems; a security application should work equally well for all citizens irrespective of where they reside, and autonomous driving systems should maintain a standard of performance irrespective of where the vehicle currently is.
There is a already a rich body of work in social sciences discussing inequality across different neighbourhoods on income, health, and other socioeconomic factors wen2003poverty, which an unreliable system may inadvertently exacerbate.
This could have adverse effects against people living in those neighbourhoods, and may disproportionately affect minorities in demographically segregated communities.
Therefore, we anticipate that explicitly communicating disaggregated performance with respect to all three factors would enhance trustability in ASC systems used in real-life environmental sensing applications.
Disaggregated evaluations can also be viewed under the perspective of recent research on the underspecification of ML architectures d2020underspecification, which corresponds to the fact that several architectures yielding similar in-domain performance nevertheless exhibit different behaviour during system deployment.
This undesired property may have negative consequences on the reliability and trustability of ASC systems.
For example, if a person using an ASC system observed substantially different performance when visiting different neighbourhoods of the same city, they might eventually lose their trust in system performance and stop using it.
As ASC architectures increasingly find their way into more real-life applications, the need to address this issue becomes more pressing.
Our evaluation reveals that different architectures yielding almost equivalent performance in standard aggregated evaluations exhibit different behaviour across different sub-populations of the herein examined datasets, thus illustrating that underspecification is also a problem for ASC applications.
This shows that disaggregated evaluations can be a useful tool for practitioners that need to select among a pool of candidate models.
3 Methodological approach
Our approach consists of the following steps.
First, we train several DNN models on the training set of each of the datasets examined here.
Each model is trained for 60epochs using SGD with a Nesterov momentum of 0.9, a learning rate of 0.001, and a batch size of 64.
For all experiments, we use log Mel spectrograms with 64
32 ms and a hop size of 10 ms.
These hyper-parameters were fixed a priori for all models and not optimised during our experiments.
Each model is trained with 5 random seeds to mitigate the effect of randomness.
Our experiments are conducted on the TUT Urban Acoustic Scenes 2018 and TUT Urban Acoustic Scenes 2018 Mobile data sets Mesaros2018_DCASE, which will be henceforth referred to as TUT-Urban and TUT-Mobile for brevity.
Both datasets contain data from 10 acoustic scenes recorded across several locations of 6 different European cities.
TUT-Urban contains 8640 stereo samples recorded at 48 kHz with a single high-quality recording device (Soundman OKM II Klassik/studio A3), whereas TUT-Mobile additionally contains 720 samples from each of two additional low-quality recording devices (Samsung Galaxy S7 and iPhone SE).
In the case of TUT-Mobile all data are stored as mono recordings at 16 kHz.
Aggregated and unitary disaggregated evaluations considering different cities in isolation.
For the aggregated evaluation, we show accuracy[%] for all test data for TUT-Urban and TUT-Mobile.
For the unitary disaggregated evaluations, we show accuracy[%] on different cities for each architecture, as well as its standard deviation (
σ) over the different cities.
Results are averaged across 5 different runs.
Intersectional evaluations considering recording device and city in combination for the TUT-Mobile dataset.
We show accuracy[%] for each combination of city and device.
Cites are Barcelona (B), Helsinki (H), London (L), Paris (P), Stockholm (S), and Vienna (V).
The best performing architecture value per city and device is marked by boldface.
Results are averaged across 5 different runs.
All models are first evaluated in the standard, aggregated way by computing a single accuracy value, and subsequently assessed using unitary and intersectional evaluations as described below.
We begin with unitary evaluations, where each factor is considered in isolation.
For city and device, where we have only 6 and 3 different groups, respectively, we simply report the accuracy for each group.
The location factor is more complicated, as we have 83 different locations in the test set, thus making it hard to visualise results.
Moreover, whereas for each city and device we have all classes available, each location corresponds to exactly one class, thus making accuracy an inappropriate metric for that evaluation.
To overcome these problems, we compute the F1
F1 score, Fl1, by the overall F1 score for that architecture.
Intersectional evaluations are in turn conducted by taking into account two, or more, factors.
Due to space limitations, we only consider results for two pairs of factors: the variation of cities across different devices and the variation across locations in different cities.
For the first case, we report the accuracy for each combination of factors.
For the latter case, we compute the Fl1 score for each location as in the unitary case, but now normalise over the F1 score for each city, Fc1.
As DL architectures, we use 5 standard DNN models that belong to different architecture families.
TDNN: we further employ a TDNN architecture.
First introduced by waibel1989phoneme with the aim of learning temporal relationships, TDNN have recently seen great success in the field of speaker identification snyder2018x.
Our TDNN architecture is identical to the DNN architecture described as the
: the final architectures considered in our experiments are CNN-based and were recently introduced by kong2020panns in the context of audio pattern recognition.
The three architectures have a total of
6, 10, and 14 layers, respectively, excluding pooling layers after convolutional layers, and take Mel-spetrograms as inputs.
The final two layers of each network are fully connected.
4 Results and discussion
Our unitary evaluation results for different cities are presented in Table1, along with the standard aggregated metrics.
We show model accuracy for each factor in isolation, and also report the standard deviation over all factors.
F1 results for different locations in TUT-Urban are shown in Figure1, where we show box-and-whisker plots of the normalised F1 scores.
We omit unitary results for different devices as they can be inferred from the intersectional results in Table2; as expected, all architectures perform best on the high-quality device A, for which we also have the most data, while doing worse on the lower quality and less populous B and C devices.
Location results on TUT-Mobile are also omitted due to space limitations but exhibit the same trend as those on TUT-Urban.
Table1 can be read both horizontally, thus emphasising which model works best for a specific factor, and vertically, where we are interested in how a specific model performs across different factors.
Overall, CNN6 is showing the strongest performance, followed by CNN10 and CNN14, with TDNN and FFNN performing substantially worse.
Furthermore, CNN6 exhibits relative stability across both cities and devices.
However, it is not the best choice for all cities; in both datasets, CNN10 is outperforming it for London and Vienna, and CNN14 for Paris, though the latter only marginally.
Of more interest is the vertical interpretation of Table1.
We observe that different architectures exhibit a different ordering when it comes to performance per city.
In TUT-Urban for example, different architectures yield their best performance on different cities: FFNN on Vienna, TDNN on Barcelona and London, CNN6 on Stockholm, CNN10 on London and Vienna, and CNN14 on Vienna.
Another interesting case is Stockholm, where CNN6 shows its best performance and TDNN its worst.
Conversely, Vienna, where FFNN, CNN10, and CNN14 show (near-)best performance for TUT-Urban, is showing mediocre results for CNN6 and TDNN.
For TUT-Mobile, these results are better visualised in Figure3 which shows the range of F1 scores per location for the different cities.
Notable differences exist; TDNN shows worse performance on Stockholm than Paris, whereas all other architectures show the opposite trend.
CNN6 and CNN10, which are almost equivalent in terms of aggregated performance, also exhibit differences, in particular for Stockholm and Vienna.
Interestingly, TDNN and FFNN deviate substantially from the other three architectures, which are more closely clustered together, indicating that models from the same family exhibit more similar behaviour.
These observations illustrate that the inductive biases introduced by each architecture manifest themselves as different behaviours on different strata of each dataset, which is in line with recent research on inductive biases ortiz2020neural, ortiz2021neural.
Figure1 additionally shows that location is a very important factor when it comes to system performance, with some locations exhibiting almost half the aggregated system performance.
Such behaviour is highly undesirable because an ASC system deployed across different locations will consistently exhibit subpar performance for some of them, with the risk to equal and fair access to service that this entails.
We note that most locations seem to exhibit better than average performance (the F1 ratio is bigger than 1).
This is caused by the fact that the worst performing locations happen to have more samples, thus having a bigger influence on aggregate performance.
Intersectional results are shown in Table2 for the combination of city and device, and in Figure2 for the combination of city and location.
The differences amongst all cities and all devices were found significant for all architectures using Kruskal-Wallis omnibus H-tests for each factor and architecture, respectively.
This shows that, in general, both factors have a large effect on model performance.
In addition, Table2 and Figure2 both show that different architectures exhibit different behaviour on different strata of the two datasets, even though they were trained on identical settings.
Overall, CNN6 is again showing the strongest performance for most, though not all, combinations, followed by CNN10.
In terms of individual factors, Paris is showing the biggest drop in performance when switching from device A to device B for all architectures, indicating that the domain shift introduced by different devices is more adversely impacting this city.
The most interesting case is TDNN, which is showing its best and worst performance on London and Stockholm for device A, respectively, but shows the exact opposite for device B, where the best performance is obtained for Stockholm and the worst for London.
In fact, the performance of TDNN on Stockholm is far better for device B than for device A, even though the latter has far more samples and should thus lead to better performance.
In this work, we argue for the need of disaggregated unitary and intersectional evaluations for the task of ASC.
Our proposed evaluation methodology reveals that several baseline architectures exhibit different behaviour even though they are trained with similar settings.
This illustrates that ASC models trained on the examined datasets suffer from the underspecification problem, which heavily impacts the development of reliable and trustworthy systems.
In the future, we intend to further investigate this problem under the perspective of inductive biases introduced by each architecture ortiz2021neural.
Moreover, our work raises interesting questions on the fairness of ASC applications.
The architectures examined here exhibit a bias with respect to different cities, locations, and devices.
If these architectures were deployed in a real-world setting, this would translate to non-uniform behaviour over these different factors.
This poses a risk to fair and equitable use of ML resources.
We believe this important point needs to be addressed as ASC models are being increasingly integrated in intelligent decision making systems.
Part of the work leading to this publication has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 957337, project MARVEL.