Tiny, always-on and fragile: Bias propagation through design choices in on-device machine learning workflows

by   Wiebke Toussaint, et al.
Delft University of Technology

Billions of distributed, heterogeneous and resource constrained smart consumer devices deploy on-device machine learning (ML) to deliver private, fast and offline inference on personal data. On-device ML systems are highly context dependent, and sensitive to user, usage, hardware and environmental attributes. Despite this sensitivity and the propensity towards bias in ML, bias in on-device ML has not been studied. This paper studies the propagation of bias through design choices in on-device ML development workflows. We position reliablity bias, which arises from disparate device failures across demographic groups, as a source of unfairness in on-device ML settings and quantify metrics to evaluate it. We then identify complex and interacting technical design choices in the on-device ML workflow that can lead to disparate performance across user groups, and thus reliability bias. Finally, we show with an empirical case study that seemingly innocuous design choices such as the data sample rate, pre-processing parameters used to construct input features and pruning hyperparameters propagate reliability bias through an audio keyword spotting development workflow. We leverage our insights to suggest strategies for developers to develop fairer on-device ML.



There are no comments yet.


page 1

page 2

page 3

page 4


Learning by Design: Structuring and Documenting the Human Choices in Machine Learning Development

The influence of machine learning (ML) is quickly spreading, and a numbe...

Distilling On-Device Intelligence at the Network Edge

Devices at the edge of wireless networks are the last mile data sources ...

Semi-supervised on-device neural network adaptation for remote and portable laser-induced breakdown spectroscopy

Laser-induced breakdown spectroscopy (LIBS) is a popular, fast elemental...

Pick the Right Edge Device: Towards Power and Performance Estimation of CUDA-based CNNs on GPGPUs

The emergence of Machine Learning (ML) as a powerful technique has been ...

Disembodied Machine Learning: On the Illusion of Objectivity in NLP

Machine Learning seeks to identify and encode bodies of knowledge within...

TRAPDOOR: Repurposing backdoors to detect dataset bias in machine learning-based genomic analysis

Machine Learning (ML) has achieved unprecedented performance in several ...

Scanflow: A multi-graph framework for Machine Learning workflow management, supervision, and debugging

Machine Learning (ML) is more than just training models, the whole workf...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

From earphones to embedded cameras, billions of tiny devices across the globe deploy on-device machine learning (ML) for inference on personal data. For example, over the past years major technology companies have incorporated on-device inference in smart speakers and smart phones to constantly process new voice signals collected by microphones (apple2021). These applications can carry considerable consequences when they fail. Consider for example a voice assistant on a smart speaker used in an elderly care application to call emergency response in case of a health crisis (askmybuddy2022). The application is activated through an on-device keyword spotting system that identifies a phrase of words (e.g. ”call help”) in the user’s voice signal. If the keyword spotting systems is biased (e.g. discriminates against users based on their age or sex), this directly impacts the reliability of the application, and consequently a user’s access to critical medical help.

In this paper we study bias in on-device ML. Rising concerns about digital privacy and personal data protection (Naeini2017Privacy) are motivating a shift in data processing and ML from cloud servers to end devices (Chen2019Deep). On-device ML is an emerging computing paradigm that makes this shift possible (banbury2020benchmarking). In contrast to ML on centralized cloud-servers, on-device ML processes data directly on the device that collected them. This has important gains for privacy: if the data never leaves the device, the potential for unsolicited use or abuse by third parties is greatly reduced. Additionally, by eliminating data transfer during inference, on-device ML enables instantaneous, continuous and offline data processing, making it possible to operate devices in an always-on mode. However, while the cloud offers limitless computing resources, on-device ML needs to account for the inherent hardware constraints of end devices: limited memory, compute and energy resources. Interventions in the ML workflow aim to overcome these constraints while retaining predictive accuracy (Dhar2021Ondevice). On-device ML is further characterised by heterogeneous devices, diverse users and unknown usage environments, which make the performance of on-device ML highly context dependent.

Given their context dependence and ubiquitous nature, developing systems with fairness in mind ought to be an important priority for on-device ML developers. A growing body of research highlights that bias is a source of unfairness in ML systems deployed for natural language processing 

(Bolukbasi2016Man), gender classification (Buolamwini2018Gender)

, face recognition 

(Raji2020savingface) and automated speech recognition (koenecke2020racial; Tatman2017Youtube). On-device ML is used for similar tasks, and leverages algorithmic approaches and data processing techniques from ML. This gives reason to suspect bias as a cause of concern in on-device ML. The resource constrained nature of on-device ML presents additional reasons to anticipate bias: the inherent constraints of end devices make the development of on-device ML a complex technical undertaking that requires mastery of hardware, software and data processing technologies. Developers are faced with a large number of design decisions to choose interventions that overcome these hardware constraints. These choices and unpredictable operating contexts can result in unexpected performance disparities between user groups  (Toussaint2021Characterising).

The goal of this work is to study the propagation of bias through design choices in the on-device ML development workflow. Our paper is the first study of bias in on-device ML settings, and makes the following contributions:

  1. We motivate reliability bias as a source of unfairness in on-device ML arising from disparate device failures across demographic groups, and quantify metrics to evaluate it.

  2. We identify complex and interacting technical design choices in the on-device ML workflow that can lead to disparate performance across user groups, and thus reliability bias.

  3. We conduct empirical experiments for an audio keyword spotting task to show that the design choices that we identified (e.g. light-weight architectures, the data sample rate, pre-processing parameters of input features, and pruning hyper-parameters for model compression) propagate reliability bias through the on-device ML workflow.

Taken together, our findings caution that seemingly innocuous design choices in the on-device ML workflow can have major consequences for the propagation of reliability bias. Our work highlights that developers and the decisions they make have an important role to play to ensure that the social requirement of unbiased on-device ML is realized within the constrained on-device setting. Based on our findings, we provide design recommendations and practical strategies to help developers navigate the gap between technical choices, deployment constraints, accuracy and bias.

The paper starts with a review of related work in Section 2. We then present an overview of on-device ML and the design choices arising during development in Section 3. In Section 4 we define and quantify reliability bias, and introduce an empirical case study on audio keyword spotting in Section 5. We present our empirical results in Section 6, and make recommendations for fairer on-device ML in Section 7. Finally, we discuss the implications of our work for the development of fair on-device ML in Section 8 and conclude in Section 9.

2. Related Work

Fairness and Bias in Machine Learning. The algorithmic fairness literature has focused predominantly on studying bias in ML systems for classification tasks, with a particular view towards the proliferation of decision-making systems that increasingly dominate public life (Mehrabi2019Survey). The drive to quantify bias in decision-making systems has been accompanied by rigorous debate on the ability of bias-measuring fairness metrics to produce fairer outcomes. Verma and Rubin (verma2018fairness) broadly categorise fairness metrics into statistical (parity) measures, similarity-based measures and causal reasoning. Additionally, fairness definitions are categorized as measuring individual or group fairness (Mehrabi2019Survey). Individual fairness metrics require that similar people are treated similarly, while group fairness metrics require that different groups are treated similarly. Jacobs and Wallach (Jacobs2021Measurement) frame fairness as a contested construct, and critique the ability of quantitative, parity-based metrics to capture the substantive nature of fairness in relation to notions of justice. Wachter et. al (wachter2021bias) also promote the need for fairness metrics to support substantive equality in order to meet the objectives of European non-discrimination law. They categorise fairness metrics as bias preserving and bias transforming, based on the metric’s treatment of historic biases propagated, for example, through data labelling decisions which can result in label bias.

Sources of Unfairness in the Machine Learning Workflow. From the perspective of statistical learning problems, fairness is influenced by bias in the training data, the predictive model and the evaluation mechanism (Mitchell2021Algorithmic). The engineering and design nature of on-device ML elevates additional considerations for fairness: a) system performance, and consequently also fairness, are influenced by design decisions (holstein2019improving); b) the fairness of a component cannot be considered in isolation but must be considered within the evolving and dynamic system in which it is incorporated (Chouldechova2018Frontiers); and c) on-device ML systems inherently contain feedback loops (Chouldechova2018Frontiers), as people buy devices that generate data which is used to train algorithms that are deployed on devices that are used by people. Bias and other sources of unfairness are intertwined within the feedback loops (Mehrabi2019Survey). In this paper we investigate unfairness due to bias arising from design choices and system composition in on-device ML.

Societal Impact of Design Choices. Holstein et. al (holstein2019improving) have observed that developers can feel a sense of unease at the societal impacts that their technical choices have, while Toussaint et. al (Toussaint2020design) have shown that early collaboration between clinical stakeholders and AI developers is important to guide design decisions to support social objectives within the public health sector. Dobbe et. al (Dobbe2021Hard) examine the impact of design choices on safety in AI systems for socio-technical decision-making in high-stakes social domains. They argue that socio-technical gaps arise in AI systems when technical functions do not satisfy the social requirements of an AI system. Drawing on this perspective, we consider the design decisions that arise in the inherently constrained on-device ML context, and examine the extent to which a relatively comprehensive set of design choices can support the social requirement for unbiased on-device ML.

Emergence of Bias in System Composition. The compound effect of multiple decisions on bias has been studied in ML pipelines (Bower2017Fair)

and for classification systems built from multiple classifiers 

(Dwork2019Fairness). In on-device ML settings, trained ML models undergo multiple post-processing steps to overcome resource constraints for on-device deployment and distribution shifts due to context heterogeneity. Some of these post-processing steps, like domain adaptation (Singh2021Fairness) and model compression (Hooker2020Characterising), can be biased. Rather than looking at the compound effect of multiple algorithmic decisions, we consider the propagation of bias through the different processing stages in the on-device ML development pipeline. We present an overview of these processing stages and design choices in the next section.

3. On-Device Machine Learning Systems

In this section we provide an overview of the development workflow for on-device ML systems, and highlight the various constraints, intervention strategies and design choices that a developer encounters while deploying such systems in practice.

3.1. Technical Development Workflow for On-device ML

The key processing steps in the on-device ML development workflow are model training, interventions, and inference as shown in Figure 1 and described below.

Figure 1. On-device machine learning development pipeline

Training. The dominant approach for developing on-device ML is to delegate resource-intensive model training to the cloud, and to deploy trained and optimized models to devices (Dhar2021Ondevice)

. The approach for training models is similar to typical ML pipelines: input data is gathered and undergoes a number of pre-processing operations to extract features from it. Thereafter, ML models are trained, evaluated and selected after optimizing a loss function on the data. Pre-trained models can also be downloaded and used if training data or training compute resources are not available.

Interventions. The key differences between on-device ML and cloud-based ML development arise due to the low compute, memory and power resources of end devices (Dhar2021Ondevice). To enable on-device deployment of the trained model, various interventions are needed to optimize the model and its data processing pipeline. Common interventions include techniques such as model pruning, model quantization, or input scaling; all of which are aimed at optimizing device-specific performance metrics such as response time or latency (banbury2021mlperf), memory consumption (Han2016deep), or energy expenditure (Yang2018netadapt) with minimal impact on the model’s accuracy. We elaborate on these intervention approaches in §3.2.

Inference. Once deployed, the trained and optimized model is used to make real-time, on-device predictions. On-device inference performance is determined by the model training process, from data collection to model selection, and the real-time sensor data input, but also by deployment constraints and interventions applied to the model.

3.2. Design Choices in the On-device ML Development Workflow

Having provided an overview of the on-device development workflow, we now discuss the key design choices that a developer has to make in this workflow. We first explain the constraints of on-device ML that necessitate these design choices, and thereafter discuss the various interventions that developers can take to satisfy these constraints. We also highlight how these interventions could impact the accuracy and bias of on-device ML models.

Deployment Constraints. On-device ML development needs to take into account the limited memory, compute and energy resources on the end devices (Dhar2021Ondevice). The available storage and runtime memory on a device limits the size of the ML models that can be deployed on it. The execution speed of inferences on the device is directly tied to the available compute resources. Moreover, the amount of computations required by a model has a direct relation to its energy consumption; given that many end devices are battery powered with limited energy resources, it becomes imperative that ML models operate within a reasonable energy budget. In addition to these resource constraints, on-device ML also has to deal with variations in the hardware and software stacks of heterogeneous user devices (banbury2020benchmarking). For instance, prior research (mathur2018using) has shown that different sensor-enabled devices can produce data at different sampling rates owing to their underlying sensor technology and real-time system state. Such variations can impact the quality of sensor data that is fed to the ML model, which in turn can impact its prediction performance.

Interventions. Research in on-device ML is largely concerned with overcoming these constraints and satisfying hardware-based performance metrics while achieving acceptable predictive performance (Dhar2021Ondevice). Prior works have developed interventions to overcome memory and compute limitations, such as weight quantization (Han2016deep) and pruning (Liu2020pruning). Other approaches such as input filtering and early exit (Huang2018multiscale), partial execution and model partitioning (Dey2019embedded) allow for dynamic and conditional computation of the ML model depending on the available system resources. Another commonly used alternative to satisfy resource constraints is to design lightweight architectures that reduce the model footprint (Yang2018netadapt; Cai2020tinytl). Finally, solutions have been proposed to make ML models robust to different resolutions of the input data (montanari2020eperceptive), which is a key to dealing with sampling rate variations in end devices. Common to all these interventions is that they trade-off a model’s resource efficiency with its prediction performance. For example, model pruning or the use of lightweight neural architectures can result in a model with smaller memory footprint and faster inference speed, however it comes at the expense of a slight accuracy degradation (Yang2018netadapt; Liu2020pruning; Cai2020tinytl).

Design choices. To build on-device ML, developers need to navigate deployment constraints and interventions alongside ML training and deployment. This is technically challenging, and charges developers with the responsibility to take design actions and make design choices at each development step.

Figure 2. Decision map of design choices in the on-device ML workflow.

We visualize some of the key design choices as a decision map in Figure 2. The availability of training data is a logical starting point during development, as it determines whether a new model can be trained, or if a pre-trained model must be downloaded. Once a developer commits to the design action of training a new model, they are confronted with design choices to select an algorithm, hyper-parameters, input features, pre-processing parameters and a data sample rate. After training or downloading the model, the developer needs to determine if it fits within the memory, compute and power budget. If it does, they can deploy the model to make predictions. If the model does not fit within the hardware budget, the developer must take design actions to optimize the model and reduce its resource requirements. This can be done through interventions like training a more light-weight architecture or compressing the model. These choices present further sub-choices, for example model compression can be done with pruning, quantization or both. Each design choice modifies the model, and has the potential of introducing bias in its predictions.

4. Bias and Fairness in On-device ML

In contrast to large-scale decision-making systems running on centralized cloud servers, on-device ML systems are distributed on billions of personal, decentralised, low resource devices that continuously capture and monitor individuals or groups of people and the environment. Considerations of fairness in on-device ML should reflect the personal and heterogeneous nature of devices, and account for the distribution and hardware constraints of on-device ML.

There are key differences in how the function of ML is conceptualized in decision-making systems and systems of distributed devices. Predictive decision-making systems position the social (or business) objective of the system, human actors subjected to the system, and the decision space available to institutional decision-makers interacting with the system as central to the overarching function of the system (Mitchell2021Algorithmic). By contrast, in systems of distributed devices, ML functions mechanistically, as a technical component constructed from and activated by personal human data (e.g. biometrics) collected with sensors. Where fairness considerations in decision-making systems are influenced by the question ”what is the consequence of the predictive outcome?”, fairness in on-device ML systems should be guided by the question ”what is the consequence of component failure?”. This component view of machine learning systems diminishes neither their significance nor their potential to inflict harm, as prediction outputs of on-device ML systems trigger other applications and even physical systems where failure can be consequential (Nushi2018Towards).

4.1. The Consequence of Failure: Framing Reliable Performance with Fairness in Mind

Device components, such as the sensor, the battery or the operating system, work similarly for all users, irrespective of their individual attributes. This is important, as component performance affects the reliability of the device. If ML components have disparate performance across demographic user groups, device failures will be systematic and result in disparate reliability across demographic groups. We define reliability bias as systematic device failures due to on-device ML performance disparities across user groups. Given the potential harms associated with reliability bias in on-device settings, it is an important aspect of fairness.

When a ML model functions as a technical component that processes personal data contained on a single hardware device, parity-based metrics are, despite their aforementioned criticisms, useful measures of reliability bias, provided that data labels are exactly known and undisputed. In many applications of on-device ML, such as wake-word detection, keyword spotting, object detection and speaker verification, this is the case. In applications where labels are ambiguous and subjective, such as emotion and intent recognition, the labelling process itself must be scrutinized as a source of bias and parity-based metrics should not be applied blindly.

In this paper we investigate on-device ML applications without label bias. While reliability bias is only one objective measure of fairness, unbiased applications are a step in the right direction towards fair on-device ML. Our aspiration for unbiased on-device ML is then that applications perform reliably (i.e. within a tolerable performance range under all operating conditions) for all users. As it is difficult to measure real-time operating performance across billions of personal devices, we constrain our investigation of reliability bias in this paper to evaluating the effect of concrete design choices on group fairness during on-device ML development.

4.2. Quantifying Reliability Bias

We consider an on-device ML model a reliable device component for a group if the group’s predictive performance equals the model’s overall predictive performance across all groups. If a model performs better or worse than average for a group, we consider it to be biased, showing favour for or prejudice against that group. Both favouritism and prejudice increase reliability bias. We want to operationalize reliability bias with a metric that captures these definitions and penalizes favouritism and prejudice equally. Additionally, the metric should be able to score models as being more or less biased, and should consider positive and negative prediction outcomes. Given these requirements, we first define bias of a model with respect to a group () as:


where is computed for data samples belonging to the group, and is computed for all samples in the test set. is 0 when a model is unbiased towards group , negative when it performs worse than average and positive when it performs better than average for the group. The magnitude of the metric is equal for a performance ratio and it’s inverse, as . This has intuitive appeal that supports the interpretability of the metric: is equal in magnitude but has opposing signs for groups that perform half as good and twice as good as average. Given the group bias scores, reliability bias is the sum of absolute score values across all groups:


In this paper we assume that all groups are equally important. The metric is thus unweighted and does not take group size into consideration. has a lower bound of 0, and an infinite upper limit. Lower scores are preferred and signify that the performance across all groups is similar to the overall performance. We now turn towards an empirical audio keyword spotting (KWS) case study to show how design choices in the on-device ML workflow propagate reliability bias.

5. A Case Study on Bias in On-device Audio Keyword Spotting

Audio keyword spotting (KWS) is one of the most dominant use cases of on-device ML (banbury2021mlperf) . We use the decision map in Figure 2 to identify design choices during audio KWS development, and the metrics introduced in the previous section to show how bias can be propagated through these choices. In this section we establish speaker groups, introduce the audio KWS task, and detail our experiment design and setup.

5.1. Establishing Speaker Groups

Human speech signals exhibit variability based on physiological attributes of the speaker (hansen2015speaker). A starting point for investigating bias in on-device audio keyword spotting (KWS) is thus to investigate inference performance for speaker groups with different physiological attributes. Speaker sex is the distinction between biological and physical characteristics of male and female speakers. We consider groups based on speaker sex to characterise the impact of design choices on reliability bias during the development of an audio KWS system. In audio KWS ground truth labels are exactly known and unambiguous, eliminating label bias. We thus use the reliability bias metric defined in Equation 2.

5.2. Overview of Audio Keyword Spotting Task

An audio keyword spotting system takes a raw speech signal as input and outputs the keyword(s) present in the signal from a set of predefined keywords. First, the input signal is split into overlapping, short time duration frames using a sliding window approach. Frame length and frame step define the duration of each frame and the step size by which the sliding window is moved. A window function is then applied to each frame to reduce spectral leakage. For each frame the speech signal is transformed into log-scaled filter bank features, producing log Mel spectrograms. Optionally, the filter bank representations can be de-correlated and compressed to generate Mel Frequency Cepstral Coefficiencts (MFCCs). The frame length and frame step, the window function, the feature type (i.e. log Mel spectrograms or MFCCs), the filter bank dimensions and the number of cepstral coefficients are design choices during pre-processing that determine input features. We thus call them pre-processing parameters

. Finally, the frame-level features are concatenated across frames and mean-normalized to form a two-dimensional representation of the speech signal which is used to train a deep neural network classifier, as described in

(chen2014smallfootprint). Thereafter, the trained network can either directly be deployed on devices if they have sufficient resources to execute it, or it can be optimized to satisfy the hardware constraints by applying various interventions discussed in §3.2.

5.3. Experiment Design and Setup

Objective and Research Questions. The objective of our case study is to evaluate the impact of design choices and choice variables on model accuracy and reliability bias for male and female speakers in an on-device ML pipeline of an audio KWS system. We aim to answer the following research questions within the context of the case study:

  1. How does a light-weight architecture affect reliability bias?

  2. How does the audio sample rate affect reliability bias?

  3. How do pre-processing parameters affect reliability bias?

  4. How do pruning hyperparameters affect reliability bias?

These research questions stem directly from the on-device ML development workflow presented in §3.1. More specifically, our experiments investigate design choices related to two important design actions for on-device ML: model training and model optimization. During model training, we consider the model architecture as an important design choice in on-device ML to satisfy resource constraints. Next, we study choices that affect the input features of the model, namely sample rate and pre-processing parameters. The sample rate can be seen as a deployment constraint due to hardware limitations such as microphone capabilities, or power consumption during data collection. Pre-processing parameters have been discussed in §5.2 and can be used as an intervention to reduce on-device power and compute requirements through fine-tuning. With regards to model optimization, we focus on model compression, in particular hyperparameter choices during post-training pruning. Post-training pruning reduces the number of model parameters, which reduces the storage and memory requirements of the model. Hyperparameters are well known to affect model accuracy during training, but their effect on bias during pruning is not established. We selected variable values based on values that are frequently used in the audio KWS literature (tucker2016model; he2017streaming; chen2014smallfootprint; he2017streaming; higuchi2020stacked), and list all variables and values considered for each design choice and design action in Table 1.

Design action Design choice Choice variable (unit) Variable values
Train new model input features — sample rate (kHz) 8, 16
Train new model input features — pre-processing feature type log Mel spectrogram, MFCC
Train new model input features — pre-processing # Mel filter banks 20, 26, 32, 40, 60, 80
Train new model input features — pre-processing # MFCCs None, 10, 11, 12, 13, 14
Train new model input features — pre-processing frame length (ms) 20, 25, 30, 40
Train new model input features — pre-processing frame step (% frame length) 40, 50, 60
Train new model input features — pre-processing window function Hamming, Hann
Reduce resource requirements light-weight architecture CNN, low latency CNN (sainath2015convolutional)
Reduce resource requirements model compression — pruning final sparsity (%) 20, 50, 75, 80, 85, 90
Reduce resource requirements model compression — pruning pruning frequency 10, 100
Reduce resource requirements model compression — pruning pruning schedule constant sparsity, polynomial decay
Reduce resource requirements model compression — pruning pruning learning rate 1e-3, 1e-4, 1e-5
Table 1. Overview of design choice variables and values for the audio keyword spotting case study

Dataset. We trained and evaluated our models on the Google Speech Commands (warden2018speech) dataset, which consists of 104,541 spoken keywords from 35 keyword classes such as Yes, No, One, Two, Three, recorded at a 16kHz sample rate. Every utterance was labelled with the speaker’s sex using a crowd-sourced data labelling campaign conducted on Amazon Mechanical Turk. We preserved the original train, validation and test sets of the dataset, but split them by speaker sex. 30% of the training, 32% of the validation and 29% of the test data are female speakers. During training we ensured that mini-batches have an equal balance of male and female speakers.

Model Architectures

. We trained two convolutional neural network (CNN) architectures originally proposed in 


and later implemented in the TensorFlow framework. The architecture that we refer to as

CNN consists of two convolutional layers followed by one dense hidden layer, while the low-latency CNN (llCNN) consists of one convolution layer followed by two dense hidden layers. The authors in (sainath2015convolutional) showed that the llCNN architecture, by virtue of having less convolution operations, is more optimized for on-device KWS.

6. Analysing the Impact of Design Choices on Bias in Audio Keyword Spotting

In this section we present the results of our experiments and analyse the impact of design choices on reliability bias during different stages of the on-device audio KWS workflow. The section is structured around the four research questions we introduced in §5.3. We start by analysing the impact of the architecture and sample rate, then analyse the impact of pre-processing parameters and finally the impact of pruning hyperparameters.

6.1. Impact of Architecture and Sample Rate

Audio keyword spotting developer benchmarks often use a 16kHz audio input (warden2018speech; mazumder2021multilingual). In practice many devices collect data at a lower sample rate of 8kHz (montanari2020eperceptive) due to hardware constraints . We thus trained models at two sample rates, 16kHz and 8kHz, for both architectures and all combinations of pre-processing parameters listed in Table 1.

Figure 3. Distributions of accuracy and reliability bias scores for two architectures (CNN, llCNN) and two sample rates (16KHz, 8KHz).

Figure 3 shows a summary of model performance for CNN and the light-weight low latency CNN (llCNN) architectures trained on 16kHz and 8kHz audio data. We evaluated model performance with five accuracy metrics: Cohen’s kappa coefficient, precision, recall, weighted F1 score and the Matthews Correlation Coefficient (MCC). The trends we observed are consistent across metrics and hence we present results only for the MCC metric. A higher MCC metric implies better prediction performance. In Figure 2(a) we show the distribution of results for the MCC accuracy metric, and in Figure 2(b) the distribution of results for reliability bias as defined in Equation 2. Our results show that the audio input sample rate affects both accuracy and reliability bias: CNN and llCNN architectures trained at 8kHz have a lower mean accuracy, and their mean reliability bias scores are 2.3 and 2.9 times higher (i.e. worse) than those of models trained at 16kHz. We also observe that model architecture has a significant effect on both accuracy and reliability bias, with the lightweight llCNN architecture being less accurate and having a higher reliability bias for the same sample rate, when compared to the CNN architecture.

Delving deeper into these findings, we analyze the relationship between subgroup (male/female) accuracy and overall accuracy for both CNN and llCNN architectures. In Figure 4, each data point represents the accuracy or reliability bias score for a single model trained with a unique combination of pre-processing parameters. Points that lie on the dotted black diagonal correspond to the models for which subgroup accuracy equals the overall accuracy. Points above the diagonal have a better, and points below have a worse subgroup accuracy than overall.

Figure 4. Accuracy scores for males (pink) and females (green). Each data point represents the accuracy score for a single experiment with a unique combination of pre-processing parameters. On the black diagonal the subgroup accuracy equals the overall accuracy.

It is easy to see that for both architectures, the scores for male speakers (green points) lie closer to the diagonal than those for female speakers (pink points), which suggests that the models have lower magnitude of bias (computed from Eq. 1

) for male speakers as compared to female speakers. Models trained with CNN architectures appear to be more prejudiced against female speakers (more pink points below the diagonal) whereas models trained with llCNN favor female speakers. Interestingly, we can now see the important role that pre-processing parameters play in the model’s performance; depending on the choice of the pre-processing parameter, the model’s accuracy and bias can vary significantly (as evident by the high variance in the dots along the diagonal). This effect is more pronounced for the lightweight llCNN architectures and at the lower sample rate (8kHz). This leads us to the next section, where we analyze the role of individual pre-processing parameters on the model performance.

6.2. Impact of Pre-processing Parameters

Having studied the effect of the sample rate, we now investigate pre-processing parameters, the next design choice listed in Table 1

. Pre-processing parameters determine the feature input in audio processing and are thus important design choices during feature extraction. We consider two feature types and their dimensionality, as well as three temporal parameters: frame length, frame step and the window type. To evaluate which pre-processing parameters have a significant impact on model accuracy and bias, we used a univariate linear regression test. Accounting for the 1726 degrees of freedom of all possible combinations of pre-processing parameters, we reject the null hypothesis for

at .

Architecture 16kHz CNN 16kHz llCNN 16kHz CNN 16kHz llCNN
# Mel fbanks 384.032* 5.2e-71 44.344* 4.9e-11 46.449* 1.8e-11 13.090* 3.1e-4
# MFCCs 0.242 6.2e-1 101.267* 1.3e-22 25.534* 5.3e-7 0.252 6.2e-1
feature type 2.041 1.5e-1 392.356* 2.9e-72 43.179* 8.6e-11 12.018* 5.5e-4
frame length 4.668 3.1e-2 8.705* 3.3e-3 20.003* 8.8e-6 3.386 6.6e-2
frame step 16.065* 6.6e-5 0.094 7.6e-1 2.648 1.0e-1 2.773 9.6e-2
window type 3.927 4.8e-2 0.726 3.9e-1 9.199* 2.5e-3 0.180 6.7e-1
Table 2. Pre-processing parameter importance for model accuracy and fairness for audio data sampled at 16kHz.

The F-scores and p-values for architectures trained on 16kHz audio input are presented in Table 

2. Starred values reject the null hypothesis, implying that the pre-processing parameters of these experiments impact accuracy or bias at a 1% significance level. We can see that feature type and dimensionality have a disproportionate influence on bias and accuracy. Tables 5 to 8

in the Appendix show the mean and standard deviation of bias and accuracy scores for MFCC and log Mel spectrogram features across feature dimensions. Log Mel spectrogram input features with 20 Mel filterbanks produce models with the lowest

reliability bias across architectures and sample rates. If we use MFCC as input features instead, the reliability bias scores increase at least 1.4 and 2.5 times over the least biased log Mel spectrogram models, for 16kHz llCNN and CNN architectures respectively. For models trained at 8kHz we observe similar trends. Figure 8 in the Appendix visualizes these results.

Figure 5. Accuracy scores for males (x-axis) and females (y-axis) for MFCC (purple) and log Mel spectrogram (aqua) feature types.

To gain a practical understanding of how pre-processing parameters affect reliability bias, we show the impact of feature type on male and female subgroup accuracy in Figure 5. On the dotted black diagonal model accuracy is equal for males and females. Points above the diagonal represent models that perform better for females, and points below the diagonal are models that perform better for males. It is immediately striking that subgroup accuracy depends strongly on the feature type, with models trained with MFCC (cyan) features performing better for males, and log Mel spectrograms (purple) better for females. This affirms that the choice of feature type strongly impacts reliability bias. Our results are supported by literature in speech science that demonstrates the importance of using different feature types for males and females (Mazairafernandez2015improving), and recent work that illustrates the necessity to consider alternative features to MFCCs (Liu2021Optimized).

Even though pre-processing parameters such as the feature type significantly impact reliability bias and accuracy, models can be accurate and unbiased. For all architectures and sample rates in Figure 5, there are models that lie on or close to the diagonal. This suggests that pre-processing parameters exist that produce accurate and unbiased models, however, these models do not necessarily have the highest accuracy score. We thus considered reliability bias as a selection criteria to explore alternative models for deployment. In Table 3 we show accuracy (MCC score) and reliability bias for the best models selected according to three different selection criteria: selecting only for accuracy, selecting only for fairness (i.e. lowest reliability bias), and selecting the fairest model with an accuracy drop of at most 1% when compared to the highest accuracy. By accepting this drop in accuracy, we can reduce reliability bias across architectures and sample rates. For the CNN architectures, reliability bias is reduced 15.7 and 1.7 fold for models trained with 16kHz and 8kHz sample rates respectively. For the 8kHz llCNN model, reliability bias is reduced 22.3 fold. The model with the highest accuracy for the 16kHz llCNN architecture also has the lowest reliability bias and thus experiences no reduction. Models selected using only fairness as selection criteria on the other hand result in a performance drop between 3.2% and 6.1%, which is considerably greater than our 1% tolerance.

model selection
16kHz CNN
8kHz CNN
16kHz llCNN
8kHz llCNN
accuracy MCC score 0.877 0.868 0.804 0.778
reliability bias 1.2e-2 9.8e-3 6.6e-4 4.1e-2
fairness MCC score 0.849 0.815 0.762 0.740
reliability bias 1.8e-4 1.9e-4 1.2e-4 1.6e-4
accuracy + fairness MCC score 0.872 0.861 0.804 0.775
reliability bias 7.7e-4 5.9e-3 6.6e-4 1.8e-3
Table 3. Comparison of MCC scores and reliability bias for 16kHz and 8kHz models selected for accuracy, fairness, and fairness within a tolerable accuracy range. Accepting a marginal drop in accuracy (up to 1%) can reduce bias considerably.

6.3. Impact of Pruning Hyper-Parameters

Pruning hyperparameters determine the pruning process, which increases model sparsity and reduces storage, memory and bandwidth requirements when downloading models to devices. We investigate the impact of pruning hyperparameters in Table 1 on accuracy and reliability bias. The top 3 models per architecture and sample rate for each model selection strategy described in Table 3 were selected. We then applied pruning as a post-training step, pruning model weights with low magnitude while also fine tuning the model. We evaluate hyperparameter importance with a univariate linear regression. Accounting for 9826 degrees of freedom of all possible pruning hyperparameter combinations for all models and the effect of the unpruned model, we reject the null hypothesis for at .

Figure 6. Post-training pruning hyperparameter importance for accuracy (MCC) and reliability bias. The dashed horizontal line indicates at . Colourful bars highlight hyperparameters that impact accuracy or reliability bias at a 1% significance level. Grey bars have no significant impact

The results of the pruning hyperparameter importance analysis are visualized in Figure 6. Our results reveal two surprising insights. Firstly, contrary to our expectations, the final sparsity of the pruned model has no significant impact on its accuracy or reliability bias. While this finding may be specific to our experiment setup, it opens an opportunity in the design space to choose a sparsity that meets hardware and bandwidth constraints. Secondly, we found that the learning rate significantly impacts accuracy and reliability bias. It is thus an important design choice not to be overlooked in the development workflow.

(a) Impact of pruning learning rate on accuracy scores for males (x-axis) and females (y-axis)
(b) Post pruning change in reliability bias for models selected under different strategies
Figure 7. Impact of post-processing design choices

We examine the effect of the pruning learning rate on accuracy for males and females in Figure 6(a) to gain a tangible understanding of its impact on bias. The dotted diagonal represents equal accuracy for male and female subgroups. It is easy to see that the smaller learning rates of 1e-05 and 1e-04 generate pruned models that favour females, while the larger learning rate of 1e-03 favours males. We do not suggest that the value of the learning rate inherently favours one subgroup over the other. Rather, our results indicate that the learning rate optimises the discovery of structure in the training data to favour one subgroup over the other. The learning rate thus needs to be empirically validated and optimised during pruning to avoid unintended bias. At present, developer guidelines suggest the opposite of what our findings reveal: to evaluate models at various sparsities and a single learning rate.

To conclude our experiments, we reflect on the model selection strategies, and consider whether post-training pruning can generate pruned models that are accurate and unbiased. In Figure 6(b) we show the density distribution of  (i.e. change in) reliability bias for the three selection strategies. Reliability bias decreases in the direction of negative change, so distributions to the left of zero are desirable. We can see that models selected for accuracy mostly become less biased. On the other hand, models selected with the fairness strategy started with a very low reliability bias, but their bias score increases after pruning. Models selected for accuracy + fairness did not necessarily result in more accurate and less biased models after pruning than those selected for accuracy only.

selection strategy accuracy accuracy + fairness
metric MCC score reliability bias MCC score reliability bias
mean var mean var mean var mean var
16kHz CNN 0.892 1.4e-03 1.1e-02 3.8e-03 0.888 2.5e-03 1.5e-03 1.1e-03
8kHz CNN 0.882 1.2e-03 9.7e-03 3.8e-03 0.876 1.1e-03 1.3e-03 1.3e-03
16kHz llCNN 0.822 1.1e-03 1.3e-02 4.6e-03 0.817 2.4e-03 2.8e-03 1.8e-03
8kHz llCNN 0.809 7.8e-04 4.8e-03 3.2e-03 0.804 3.3e-03 7.6e-04 4.8e-04
Table 4. Mean and variance of MCC scores and reliability bias across six pruning sparsities (0.2, 0.5, 0.75, 0.8, 0.85, 0.9) for two model selection strategies. Fairer models can be selected for all sparsities at an accuracy cost of less than 1% .

Based on this analysis, we suggest that if initial model training is followed by pruning, it suffices to use accuracy as a selection strategy to select several models with high accuracy scores after training. During pruning, hyperparameters should be treated as design choices and evaluated across a range of reasonable values. Selecting a pruned model on accuracy alone will, however, not guarantee an unbiased model. In Table 4 we show the mean and variance of accuracy and reliability bias across all sparsities for each architecture and sample rate, for pruned models selected with the accuracy and accuracy + fairness strategies. For all models the variance of metrics across sparsities is low, supporting our observation that final sparsity has little impact on accuracy and reliability bias in our experiments. Our analysis also shows that mean reliability bias can be improved by an order of magnitude by considering fairness during model selection. As with pre-processing parameters, we thus conclude that less biased models can be selected after pruning for each sparsity at a marginal cost to accuracy.

7. Design Choices for Fairer On-Device Machine Learning

We conducted empirical experiments for an audio keyword spotting task to investigate the impact of a comprehensive set of design choices on reliability bias in on-device ML. Below we summarize how design choices impact reliability bias and make actionable suggestions for developers to navigate the complex on-device ML workflow with fairness in mind.

Data gathering. Our results show that audio KWS models trained on a higher sampling rate (16kHz) are more accurate and less biased than those trained on a lower sampling rate (8kHz). This is true for male and female speaker subgroups. However, accuracy scores for females have greater variance and lower mean values than those for males. If ML developers have control over the data gathering stage, they can focus their data collection efforts on end devices that support higher sampling rates. Alternatively, they can inform the hardware design of end devices to include microphone components with the desired sampling rate.

Model training. This stage involves many design choices that influence reliability bias and accuracy (e.g. input features and model architecture). Our results indicate that the mean and variance in reliability bias are greater for models with a light-weight architecture. This is true for male and female speaker subgroups. Feature type and dimensions have a greater impact on reliability bias and accuracy than temporal pre-processing parameters. We found on average that log Mel spectrogram features produce less biased audio KWS models than MFCC features. However, MFCC features perform better for males while log Mel spectrograms work better for females. Developers should thus consider the application context and user demographics when designing input features. We also recommend that developers iterate through design choices to determine which parameter values provide an acceptable trade-off between prediction performance and reliability bias. By considering fairness during model selection, pre-processing parameters can be chosen to train fairer models at only a small cost to accuracy.

Interventions. We found the pruning learning rate to be the post-training pruning hyperparameter with the most significant impact on accuracy and reliability bias. In our case study, choosing a smaller learning rate value results in pruned models that favour females, while a large value favours males. Selecting several models for optimization, iterating over optimization parameters (e.g. pruning hyperparameters) and considering a measure such as reliability bias as a satisficing metric allows developers to achieve a trade-off between accuracy and bias when applying interventions for model optimization. We recommend that developers apply model selection strategies that consider accuracy and fairness after pruning to deploy fairer models with only a small cost to accuracy.

8. Discussion and Limitations

Having summarised the quantitative results and recommendations in the previous section, we now take a higher level perspective to reflect on the overarching implications and limitations of our work on bias and fairness in on-device ML.

Reliability Bias as a Source of Unfairness in On-device ML. We have investigated the propagation of bias through design choices in the on-device ML workflow, and identified reliability bias as a source of unfairness. Reliability bias arises from disparate on-device ML performance due to demographic attributes of users, and results in systematic device failure across user groups. We quantified reliability bias drawing on definitions of group fairness, and used the metric in empirical experiments to evaluate the impact of design choices on bias in an audio keyword spotting task, a dominant application of on-device ML. Our results validate that seemingly innocuous design choices – a light-weight architecture, the data sample rate, pre-processing parameters of input features, and pruning hyper-parameters for model compression – can result in disparate predictive performance across male and female groups.

Propagation of Reliability Bias through Design Choices. Focusing on a specific case study allowed us to rigorously investigate bias within a constrained scope to gain insights on how careful consideration of design choices can help build fairer systems. Based on our findings we do not promote bias as an immutable property of a particular model. Instead, we position that bias arises from design choices that amplify or reduce disparate predictive performance across demographic groups. We also suggest that using a bias metric (such as the one we propose) as a satisficing metric in the on-device ML development workflow enables developers to consider the trade-offs between accuracy and bias, and can help reduce performance disparities across users.

Extending Reliability Bias to Hardware Performance and New Modalities. We have focused our evaluation of bias on predictive performance. In on-device applications, system efficiency is another important performance metric. For example, a keyword spotting system with poor predictive performance can require several user attempts to activate the system. This can increase computations, which leads to increased power consumption and faster drainage of a device’s battery. Reliability bias should thus also be considered for hardware performance. The bias measure that we have proposed can easily be extended to characterise reliability bias due to system (in)efficiency. We will investigate this in future work. Additionally, we note that our empirical study is focused on audio-based ML. Although audio is a prominent data modality in on-device ML, we are cognizant that other data types (e.g. images) are also used. Future work can extend our methodology to different modalities and new learning tasks to investigate reliability bias in them.

Limitations. Considered from the overarching objective of fair AI, we acknowledge limitations in our study. As argued in  (Balayn2021Beyond), we recognize that bias is not the only type of harm that can arise in on-device ML. Furthermore, measures of group fairness have shortcomings: they do not account for performance differences between individuals within groups, and assume that groups can be determined in advance. Even if this is possible, constructing groups remains a normative design decision that requires careful consideration (Mitchell2020Diversity). In our audio keyword spotting case study we construct groups based on a speaker’s sex. Sex is just one of many demographic attributes that influences the human voice (Singh2019Profiling) and can result in bias. Our investigation of bias across groups is thus by no means comprehensive, but it provides sufficient evidence to highlight the urgency of addressing bias in on-device settings. Thus, despite these limitations, studying bias in the emerging field of on-device ML is an important research direction for the fairness community.

9. Conclusion

Billions of device deploy on-device ML today. Despite bias and fairness being a major area of concern in traditional ML, they have not been considered in on-device ML settings. Biased performance impacts device reliability, and can result in systematic device failures due to performance disparities across user groups. This can have significant consequences for users. Our study of bias propagation through design choices in the on-device ML workflow is the first study of bias in this emerging domain, and lays an important foundation for building fairer on-device ML systems.

We thank Roel Dobbe, Sem Nouws and Dewant Katare for their feedback and useful suggestions on the work.


Appendix A Appendix

a.1. Impact of Pre-processing parameters

16k CNN 16k low latency CNN
Mel fbanks 20 26 32 40 60 80 20 26 32 40 60 80
accuracy mean 8.6e-01 8.6e-01 8.5e-01 8.4e-01 8.2e-01 7.9e-01 7.8e-01 7.6e-01 7.4e-01 7.1e-01 6.5e-01 6.0e-01
std 9.6e-03 9.3e-03 8.5e-03 8.3e-03 8.9e-03 1.4e-02 9.9e-03 8.7e-03 1.0e-02 8.8e-03 1.3e-02 1.5e-02
model bias mean 6.7e-03 9.5e-03 1.2e-02 1.5e-02 2.2e-02 1.8e-02 1.4e-02 1.4e-02 1.5e-02 4.2e-02 5.0e-02 2.8e-02
std 4.3e-03 6.7e-03 7.7e-03 8.5e-03 7.8e-03 1.1e-02 8.4e-03 1.1e-02 1.0e-02 1.4e-02 1.6e-02 1.7e-02
Table 5. Mean and standard deviation of accuracy and model bias across log Mel filterbanks at 16k audio input
8k CNN 8k low latency CNN
Mel fbanks 20 26 32 40 60 80 20 26 32 40 60 80
accuracy mean 8.5e-01 8.4e-01 8.3e-01 8.1e-01 7.7e-01 7.4e-01 7.5e-01 7.2e-01 6.8e-01 6.5e-01 5.8e-01 5.3e-01
std 7.1e-03 7.7e-03 9.6e-03 1.1e-02 1.4e-02 1.6e-02 1.1e-02 9.7e-03 1.5e-02 1.2e-02 1.1e-02 1.7e-02
model bias mean 1.1e-02 2.4e-02 2.7e-02 3.1e-02 3.1e-02 1.9e-02 1.8e-02 4.0e-02 5.0e-02 4.9e-02 2.5e-02 2.3e-02
std 7.5e-03 7.8e-03 9.5e-03 7.7e-03 1.2e-02 1.2e-02 1.0e-02 1.1e-02 1.4e-02 1.8e-02 1.7e-02 1.5e-02
Table 6. Mean and standard deviation of accuracy and model bias across log Mel filterbanks at 8k audio input
16k CNN 16k low latency CNN
MFCCs 10 11 12 13 14 10 11 12 13 14
accuracy mean 8.3e-01 8.3e-01 8.4e-01 8.3e-01 8.4e-01 7.6e-01 7.6e-01 7.6e-01 7.6e-01 7.6e-01
std 2.6e-02 3.8e-02 2.4e-02 3.3e-02 2.2e-02 1.1e-02 1.1e-02 1.1e-02 9.8e-03 9.6e-03
model bias mean 2.3e-02 1.7e-02 2.2e-02 2.3e-02 2.1e-02 2.0e-02 1.9e-02 2.4e-02 2.6e-02 2.2e-02
std 1.3e-02 1.2e-02 1.2e-02 1.2e-02 1.3e-02 1.4e-02 1.4e-02 1.3e-02 1.4e-02 1.3e-02
Table 7. Mean and standard deviation of accuracy and model bias across MFCCs at 16k audio input
8k CNN 8k low latency CNN
MFCCs 10 11 12 13 14 10 11 12 13 14
accuracy mean 8.2e-01 8.2e-01 8.2e-01 8.2e-01 8.2e-01 7.6e-01 7.6e-01 7.6e-01 7.5e-01 7.5e-01
std 3.0e-02 4.3e-02 3.0e-02 3.0e-02 2.6e-02 1.1e-02 1.0e-02 1.0e-02 9.2e-03 9.2e-03
model bias mean 3.0e-02 2.8e-02 3.0e-02 3.0e-02 2.9e-02 4.4e-02 4.4e-02 4.7e-02 4.4e-02 4.4e-02
std 1.4e-02 1.6e-02 1.4e-02 1.4e-02 1.5e-02 1.4e-02 1.6e-02 1.5e-02 1.5e-02 1.5e-02
Table 8. Mean and standard deviation of accuracy and model bias across MFCCs at 8k audio input
Figure 8. Pre-processing parameter importance for model accuracy (MCC) and fairness scores (dashed line shows at ; grey bars have no significant impact)