Trustworthy AI: A Computational Perspective

by   Haochen Liu, et al.
Michigan State University

In the past few decades, artificial intelligence (AI) technology has experienced swift developments, changing everyone's daily life and profoundly altering the course of human society. The intention of developing AI is to benefit humans, by reducing human labor, bringing everyday convenience to human lives, and promoting social good. However, recent research and AI applications show that AI can cause unintentional harm to humans, such as making unreliable decisions in safety-critical scenarios or undermining fairness by inadvertently discriminating against one group. Thus, trustworthy AI has attracted immense attention recently, which requires careful consideration to avoid the adverse effects that AI may bring to humans, so that humans can fully trust and live in harmony with AI technologies. Recent years have witnessed a tremendous amount of research on trustworthy AI. In this survey, we present a comprehensive survey of trustworthy AI from a computational perspective, to help readers understand the latest technologies for achieving trustworthy AI. Trustworthy AI is a large and complex area, involving various dimensions. In this work, we focus on six of the most crucial dimensions in achieving trustworthy AI: (i) Safety Robustness, (ii) Non-discrimination Fairness, (iii) Explainability, (iv) Privacy, (v) Accountability Auditability, and (vi) Environmental Well-Being. For each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems. We also discuss the accordant and conflicting interactions among different dimensions and discuss potential aspects for trustworthy AI to investigate in the future.


page 3

page 11

page 22

page 23

page 24


Adversarial Interaction Attack: Fooling AI to Misinterpret Human Intentions

Understanding the actions of both humans and artificial intelligence (AI...

Challenges of Human-Aware AI Systems

From its inception, AI has had a rather ambivalent relationship to human...

The relationship between trust in AI and trustworthy machine learning technologies

To build AI-based systems that users and the public can justifiably trus...

Human-AI Symbiosis: A Survey of Current Approaches

In this paper, we aim at providing a comprehensive outline of the differ...

A brief history of AI: how to prevent another winter (a critical review)

The field of artificial intelligence (AI), regarded as one of the most e...

Forecasting Transformative AI: An Expert Survey

Transformative AI technologies have the potential to reshape critical as...

Digital Normativity: A challenge for human subjectivization and free will

Over the past decade, artificial intelligence has demonstrated its effic...

1. Introduction

Artificial intelligence (AI), a science that studies and develops the theory, methodology, technology, and application systems for simulating, extending, and expanding human intelligence, has brought revolutionary impact to modern human society. From a micro view, AI plays an irreplaceable role in many aspects of our lives. Modern day life is filled with interactions with AI applications: from unlocking a cell phone with face ID, talking to a voice assistant, and buying products recommended by e-commerce platforms. From a macro view, AI creates great economic outcomes. The future of jobs report 2020 from the World Economic Forum (Forum, 2020) predicts that AI will create 58 million new jobs in five years. By 2030, AI is expected to produce extra economic profits of 13 trillion US dollars, which contribute 1.2% annual growth to the GDP of the whole world (Bughin et al., 2018).

However, along with their rapid and impressive development, AI systems have also exposed their untrustworthy sides. For example, safety-critical AI systems are shown to be vulnerable to adversarial attacks. Deep image recognition systems in autonomous vehicles could fail to distinguish road signs which are modified by malicious attackers (Xu et al., 2020b). It poses a great threat to the safety of the passengers. In addition, AI algorithms can cause bias and unfairness. Online AI chat-bots could produce indecent, racist and sexist content (Wolf et al., 2017)

, which offends users and has a negative social impact. Moreover, AI systems carry the risk of disclosing user privacy and business secrets. Hackers can take advantage of the feature vectors produced by an AI model to reconstruct private input data, such as fingerprints

(Al-Rubaie and Chang, 2019), thereby leaking sensitive information of a user. These vulnerabilities can make existing AI systems unusable and can cause severe economic and security consequences. Concerns around trustworthiness have become a huge obstacle for AI to overcome to advance as a field, become more widely adopted and create more economic value. Hence, how to build trustworthy AI systems has become a focal topic in both academia and industry.

In recent years, a large body of literature on trustworthy AI has emerged. With the increasing demand for building trustworthy AI, it is imperative to summarize existing achievements and discuss possible directions. In this survey, we provide a comprehensive overview of trustworthy AI, to help newcomers get a basic understanding of what makes an AI system trustworthy and to help veterans track the latest progress in trustworthy AI. We clarify the definition of trustworthy AI and introduce six key dimensions of trustworthy AI. For each dimension, we present its concepts and taxonomies and review representative algorithms. In addition, we also introduce possible interactions among different dimensions and discuss other potential issues about trustworthy AI that have not drawn sufficient attention yet. In addition to definitions and concepts, our survey focuses on the specific computational solutions for realizing each dimension of trustworthy AI. This perspective makes it distinct from some extant related works, such as a government guideline (Smuha, 2019), which suggests how to build a trustworthy AI system in the form of laws and regulations, or reviews (Brundage et al., 2020; Thiebes et al., 2020), which discuss the realization of trustworthy AI in a high-level non-technical perspective.

Figure 1. Six key dimensions of trustworthy AI.

According to a recent ethics guideline for AI provided by the European Union (EU) (Smuha, 2019), a trustworthy AI system should meet four ethical principles: respect for human autonomy, prevention of harm, fairness, and explicability. Based on these four principles, AI researchers, practitioners, and governments propose various specific dimensions for trustworthy AI (Smuha, 2019; Thiebes et al., 2020; Brundage et al., 2020). In this survey, we focus on six important and concerning dimensions that have been extensively studied. As shown in Figure 1, they are Safety & Robustness, Non-discrimination & Fairness, Explainability, Privacy, Auditability & Accountability, and Environmental Well-Being.

The remaining survey is organized as follows. In Section 2, we will first articulate the definition of trustworthy AI. We provide various definitions of trustworthy AI, to help readers understand how a trustworthy AI system is defined by researchers from different disciplines, such as computer science, sociology, law, and business. Then we distinguish the concept of trustworthy AI from several related concepts such as ethical AI and responsible AI.

In Section 3, we will detail the dimension of Safety & Robustness

. Safety & Robustness desires an AI system to be robust to noisy perturbations of the inputs and to be able to make secure decisions. In recent years, a lot of studies have shown that AI systems, especially those that adopt deep learning models, can be very sensitive to intentional or unintentional perturbations of the inputs, posing huge risks to safety-critical applications. For example, as described before, autonomous vehicles can be fooled by altered road signs. Additionally, spam detection models can be fooled by emails with well-designed text

(Barreno et al., 2010). Thus, spam senders can take advantage of this weakness to make their emails immune to the detection system, which would cause a bad user experience.

It has been demonstrated that AI algorithms can learn the discrimination of humans via the provided training examples and make unfair decisions. For example, some face recognition algorithms have difficulties in detecting faces of African Americans

(Rose, 2010) or misclassify them as “gorillas” (Howard and Borenstein, 2018). Moreover, voice-dictation software typically performs better when recognizing a voice from a male than that from a female (Rodger and Pendharkar, 2004). In Section 4, we will introduce the dimension of Non-discrimination & Fairness where an AI system is expected to avoid unfair bias towards certain groups or individuals.

In Section 5, we will discuss the dimension of Explainability. Explainability suggests that the decision mechanism of an AI system should be able to be explained to stakeholders. For example, AI techniques have been used for disease diagnosis based on the symptoms and physical features of a patient (Sajda, 2006). In such cases, a black-box decision is not acceptable. The inference process should be transparent to doctors and patients to ensure that the diagnosis is exact in every detail.

It has been found that some AI algorithms can store and expose users’ personal information. For example, dialogue models trained on human conversation corpus can remember sensitive information like credit card numbers, which can be elicited by interacting with the model (Henderson et al., 2018). Thus, an AI system should avoid the leakage of any private information. In Section 6, we will present the dimension of Privacy.

In Section 7, we will describe the dimension of Auditability & Accountability. With Auditability & Accountability, an AI system should be assessed by a third party, and hold someone responsible for an AI failure, especially in critical applications (Smuha, 2019).

Recently, the environmental impacts of AI have drawn people’s attention, since some large AI systems are found to be very energy-consuming. As a mainstream technology of AI, deep learning is moving towards the direction of pursuing larger models and more parameters. Accordingly, more storage and computation resources are consumed. A study (Strubell et al., 2019) shows that training a BERT model (Devlin et al., 2019) costs a carbon emission of around 1,400 pounds of carbon dioxide, which is comparable to that of a round-trip trans-America flight. Therefore, an AI system should be sustainable and environmentally friendly. In Section 8, we will review the dimension of Environmental Well-Being.

In Section 9, we will discuss the interactions among the different dimensions. Recent studies have demonstrated that there are accordance and conflicts among different dimensions of trustworthy AI (Smuha, 2019; Whittlestone et al., 2019)

. For example, the robustness and explainability of deep neural networks are tightly connected and robust models tend to be more interpretable

(Tsipras et al., 2018; Etmann et al., 2019) and vice versa (Noack et al., 2021). Moreover, it is shown that in some cases, there exists a trade-off between robustness and privacy. For instance, adversarial defense approaches can make a model more vulnerable to membership inference attacks, which increases the risk of training data leakage (Song et al., 2019).

In addition to the aforementioned six dimensions, there are more dimensions of trustworthy AI, such as human agency and oversight, creditability, etc. Though these additional dimensions are as important as the six dimensions mentioned here, they are in earlier stages of development with limited literature, especially for computational methods. Thus, in Section 10, we will discuss these dimensions of trustworthy AI as future directions needing dedicated efforts.

2. Concepts and Definitions

The word “trustworthy” is interpreted as ”worthy of trust of confidence; reliable, dependable” in Oxford English Dictionary or ”able to be trusted” in the Dictionary of Cambridge. “Trustworthy” inherits from the word ”trust”, which is described as the ”firm belief in the reliability, truth, or ability of someone or something” in Oxford English Dictionary or the ”belief that you can depend on someone or something” in the Dictionary of Cambridge. Broadly speaking, trust is a widespread notion in human society, which lays the important foundation for the sustainable development of human civilization. Strictly speaking, there always exist some potential risks in our external environment because we cannot completely control people and other entities in our environment. (Mayer et al., 1995; Thiebes et al., 2020) It is our trust over these parties that allows us to willingly put ourselves at potential risk to continue interacting with them (Lee and See, 2004). Trust is necessary among people. It is the basis of a good relationship, and a necessity for people to live happily together and work efficiently together. In addition, trust is also vital between humans and technology. Without trust, humans will not be willing to unitize technology, which would undoubtedly impede the advancement of technology and prevent humans from enjoying the convenience brought by technology. Therefore, for a win-win situation between humans and technology, it is necessary to guarantee that the technology is trustworthy so that people can build trust in it.

The terminology ”artificial intelligence” got its name from a workshop in a 1956 Dartmouth conference (Buchanan, 2005; McCarthy et al., 2006). There are numerous definitions for AI (Kok et al., 2009). Generally, AI denotes a program or a system which is able to cope with a real-world problem with human-like reasoning capability. For example, we can think of the field of image recognition within AI, which uses deep learning networks to recognize objects or people within images (Rawat and Wang, 2017). The past few decades have witnessed rapid and impressive development of AI, there are tons of breakthroughs happening in every corner of this field (Russell and Norvig, 2002; Sze et al., 2017). Furthermore, with the rapid development of big data and computational resources, AI has been broadly applied to many aspects of human lives, including economics, healthcare, education, transportation and so on, where it has revolutionized industries and achieved numerous feats. Considering the important role AI plays in modern society, it is necessary to make AI trustworthy so that humans can rely on AI with minimal concerns regarding its potential harm. Trust is essential in allowing for the potential of AI to be fully realized. This allows humans to fully enjoy the benefit and convenience of AI (Commission and others, 2019).

Perspective Principles
Technical Accuracy, Robustness,
User Availability, Usability,
Safety, Privacy, Autonomy
Social Law-abiding, Ethical, Fair,
Accountable, Environmental-friendly
Table 1. A summary of principles for Trustworthy AI from different aspects.

Due to its importance and necessity, trustworthy AI has drawn increasing attention, and there are numerous discussions and debates over its definition and extension (Commission and others, 2019). In this survey, we define trustworthy AI as programs and systems built to solve problems like a human, which bring benefits and convenience to people with no threat or risk of harm. We can further define trustworthy AI from the following three perspectives: the technical perspective, the user perspective and the social perspective. An overall description of these perspectives is summarized in Table 1.

  • From a technical perspective, trustworthy AI is expected to show the properties of accuracy, robustness, and Explainability. Specifically, the AI programs or systems should generate accurate output, consistent with the ground truth, as much as possible. This is also the first and most basic motivation for building them. Also, AI programs or systems should be robust to changes that are not supposed to affect their outcome. This is very important, since real environments where AI systems are deployed are usually very complex and volatile. Last, but not least, trustworthy AI must allow for explanation and analysis by humans, so that the potential risks and harm can be minimized. In addition, trustworthy AI should be transparent so people can better understand its mechanism.

  • From a user’s perspective, trustworthy AI should possess the properties of availability, usability, safety, privacy and autonomy. Specifically, AI programs or systems should be available for people whenever they need them, and these programs or systems should be easy to use for people with different backgrounds. More importantly, the AI programs or systems are expected to do harm to people under no conditions, and always put the safety of users as the priority. In addition, trustworthy AI is supposed to protect the privacy of all users. It should deal with data storage very carefully and seriously. Last, but not least, the autonomy of trustworthy AI should always be under people’s control. In other words, it is always a human’s right to grant an AI system any decision-making power or withdraw the power at any time.

  • From a social perspective, trustworthy AI should be law-abiding, ethical, fair, accountable and environmental friendly. Specifically, the AI programs or systems should operate in full compliance with all relevant laws and regulations. They are supposed to comply with the ethical principles of human society. Importantly, trustworthy AI should show non-discrimination towards people from various backgrounds. It should guarantee justice and fairness among all users. Also, trustworthy AI should be accountable, which means it is clear who would be responsible for each part of the AI system. Lastly, for the sustainable development and long-term prosperity of our civilization, AI programs and systems should be environmentally friendly. For example, they should limit energy consumption and cause minimal pollution.

Note that the above properties of the three perspectives are not independent of each other. Instead, they complement and reinforce each other.

There are numerous terminologies related to AI proposed recently, including ethical AI, beneficial AI, responsible AI, explainable AI, fair AI and so on. These terminologies share some overlap and distinction with trustworthy AI in terms of the intention and extension of the concept. Next, we briefly describe some related terminologies to help enhance the understanding of trustworthy AI.

  • Ethical AI (Floridi et al., 2018): An ethical framework of AI in (Floridi et al., 2018) specifies five core principles for ethical AI, including beneficence, non-maleficence, autonomy, justice and explicability. Also, twenty specific action points from four categories have been proposed to ensure continuous and effective efforts. They are assessment, development, incentivization and support.

  • Beneficial AI (of Life Institute, 2017): AI has undoubtedly brought people countless benefits, but to gain sustainable benefits from AI, 23 principles have been proposed in conjunction with the 2017 Asilomar conference. These principles are based on three aspects: research issues, ethics and values, and longer-term issues.

  • Responsible AI: A framework for the development of responsible AI has been proposed  (312; 142). It consists of 10 ethical principles: well-being, respect for autonomy, privacy and intimacy, solidarity, democratic participation, equity, diversity inclusion, prudence, responsibility and sustainable development. The Chinese National Governance Committee for the New Generation Artificial Intelligence has proposed a set of governance principles to promote the healthy and sustainable development of responsible AI. Eight principles have been listed as follows: harmony and human-friendly, fairness and justice, inclusion and sharing, respect for privacy, safety and controllability, shared responsibility, open and collaboration and agile governance.

  • Explainable AI (Adadi and Berrada, 2018): The basic aim of explainable AI is to open up the ”black box” of AI, to offer a trustworthy explanation of AI to users. Also, it aims at proposing more explainable AI models, which can provide promising model performance and can be explained in non-technical terms at the same time, so that users can fully trust them and take full advantage of them.

  • Fair AI (Zou and Schiebinger, 2018): AI is designed by humans and data plays a key role in most of the AI models, thus it is easy for AI to inherit some bias from its creators or input data. Without proper guidance and regulations, AI could be bias and unfair toward a certain group of people. Fair AI denotes AI that shows no discrimination towards people from any group. Its output should have little correlation with the traits of individuals, such as gender and ethnicity.

Overall, trustworthy AI has a very rich connotation from different perspectives. It contains the concepts of many existing terminologies including fair AI, explainable AI and so on. Also, there exist huge overlaps among the proposed concept of trustworthy AI and the concepts of ethical AI, beneficial AI and responsible AI. All of them aim at building up reliable AI that can benefit human society sustainably. However, some differences also exist among these concepts since they are proposed from different perspectives. For example, in beneficial AI and responsible AI, there are also some principles and requirements for the users of AI and governors. While in the proposed trustworthy AI, we mainly focus on the principles for the AI technology itself. Figure 2 illustrates the relations among these concepts.

Figure 2. The relation between trustworthy AI and related concepts.

3. Safety & Robustness

A trustworthy AI system should achieve stable and sustained high accuracy under different circumstances. Specifically, a model should be robust to small perturbations since the real-world data contains diverse types of noise. In recent years, many studies have shown that machine learning (ML) models can be fooled by small, designed perturbations, namely the adversarial perturbations 

(Szegedy et al., 2013; Madry et al., 2017)

. From traditional machine classifiers  

(Biggio et al., 2013) to deep learning models, like CNN  (Szegedy et al., 2013), GNN (Scarselli et al., 2008) or RNN (Zaremba et al., 2014), none of the models are sufficiently robust to such perturbations. This raises huge concerns when ML models are applied to safety-critical tasks, such as authentication (Chen et al., 2017), auto driving (Sitawarin et al., 2018), recommendation (Fan et al., 2021; Fang et al., 2018), AI health care (Finlayson et al., 2019), etc. To build safe and reliable ML models, studying adversarial examples and the underlying reasons is urgent and essential. In this section, we aim to introduce the concepts of robustness, how attackers design threat models, and how we develop different types of defense strategies. We first introduce the concepts for adversarial robustness. Then, we provide more details by introducing the taxonomy and give examples for each category. After that, we discuss different adversarial attacks and defense strategies and introduce representative methods. Next, we introduce how adversarial robustness issues affect real-world AI systems. We also present existing related tools and surveys to help readers get easy access. Finally, we demonstrate some potential future directions in adversarial robustness.

3.1. Concepts and Taxonomy

In this subsection, we briefly describe the commonly used and fundamental concepts in AI robustness to illustrate an overview of adversarial attacks and defenses and introduce the taxonomy based on these concepts.

3.1.1. Threat Models

An adversarial threat model, which can be denoted as adversarial attack, is a process that an attacker tries to break the performance of ML models with fake training or test examples. The existence of adversarial attacks could lead to serious security concerns in a wide range of ML applications. To achieve the attacker’s goal, there are different types of strategies. Hence threat models can be categorized into different types. Here we introduce different categories of threat models from different aspects, including when the attack happens, what knowledge an attacker can access, and what is the goal of the adversary.

Poisoning Attacks vs. Evasion Attacks. Whether an attack is evasion or poisoning depends on whether attackers modify the training samples or test samples. Poisoning attack happens when attackers add fake samples into the training set of a classification model. These fake samples are designed intentionally to train a bad classifier, which achieves overall bad performance (Biggio et al., 2012), or gives wrong predictions on certain test samples(Zügner et al., 2018). This type of attack can happen when the adversary has access to the training data. This is a realistic safety concern. For example, training data for an online shopping recommendation system is often collected from web users where attackers may exist. A special case of poisoning attack is backdoor attack. A trigger that is only known by the attacker is added to the training examples. This trigger is assigned with a target wrong label at the same time. The test samples with such trigger would be classified as the target label. This would cause severe harm for authentication systems, like face recognition system (Chen et al., 2017). Evasion attack happens in the test phase. Given a well trained classifier, attackers aim to design small perturbations for test samples in order to get wrong predictions from the victim model. From the figure in (Goodfellow et al., 2014), we can see that the image of panda can be correctly classified by the model, while the perturbed version will be classified as a gibbon.

White-box attacks vs. Black-box attacks. According to the adversary’s knowledge, attacking methods can be categorized into white-box and black-box attacks. White-Box attacks is a setting that the adversary can utilize all information of the target model, including its architecture, parameters, gradient information, etc. Generally, the attacking process can be formulated as an optimization problem(Goodfellow et al., 2014; Carlini and Wagner, 2017b). With the access of such white-box information, this problem is often much easier to solve with gradient based methods. White-box attacks have been extensively studied because the disclosure of model architecture and parameters helps people understand the weakness of ML models clearly and thus it can be analyzed mathematically. Under Black-Box attacks, no knowledge of ML models is available to adversaries. Adversaries can only feed the input data and query the outputs of the models. One of the most common ways to perform black-box attacks is to keep querying the victim model and approximate the gradient through numerical differentiation methods. Compared to white-box attacks, black-box attacks are more practical because ML models are less likely to be white-box due to privacy issue in reality.

Targeted Attacks vs. Non-Targeted Attacks. In the image classification problem, threat models can be categorized by whether the adversary wants to get a pre-set label for a certain image. Under Target attacks, a specified target prediction label is expected for each adversarial example in the test phase. For example, identity theft may want to fool the face recognition system and pretend to be a specific important figure. In contrast, Non-target Attacks expect an arbitrary prediction label except for the real one. Note that in other data domains, like graphs, the definition of Targeted Attack can be extended to mislead certain groups of nodes, but it is not necessary to force the model to give a certain prediction.

3.2. Victim Models

Victim models are the models that are attacked by the attacker. The victim model ranges from traditional machine learning models like SVM (Biggio et al., 2012)

to Deep Neural Networks (DNNs) , including Convolution Neural Network (CNN) 

(LeCun et al., 1995), Graph Neural Network (GNN) (Scarselli et al., 2008)

, Recurrent Neural Network (RNN) 

(Zaremba et al., 2014), etc. Next, we briefly introduce the victim models that have been studied and have been shown to be vulnerable to adversarial attacks.

Traditional machine learning models.

One of the earliest robustness related works checked the security of Naive Bayes classifiers 

(Dalvi et al., 2004). Later on, SVM and the naive fully-connected neural networks have been shown to be vulnerable towards attacks (Biggio et al., 2013). Recently, the adversarial robustness of tree-based models is also proposed as an open problem  (Chen et al., 2019a).

Deep learning models.

In computer vision tasks, Convolution Neural Networks (CNNs)  

(Krizhevsky et al., 2012) are one of the most widely used models for image classification problem. CNN models aggregate the local features from images to learn the representations of image objects and give prediction based on the learned representations. The vulnerability issue of deep neural network was first proposed in CNN (Szegedy et al., 2013) and since then there have been lots of work shown that CNNs are not robust toward adversarial attacks. Graph Neural Networks (GNNs) have been developed for graph-structured data and can be used by many real-world systems such as social network and natural science. There are works testing the robustness of GNNs  (Zügner et al., 2018; Chen et al., 2018c; Dai et al., 2018; Bojchevski and Günnemann, 2019; Ma et al., 2019b) and trying to build robust GNNs (Jin et al., 2020). Take node classification problem as an example, existing works show that the performance can be reduced significantly by slightly modifying node features, adding or deleting edges, or adding fake nodes (Zügner et al., 2018). Recurrent Neural Networks (RNNs) and their variants have been proposed to handle sequence data. Attacking RNN models, especially RNN models for text data, is a challenging topic due to the special property of text data. To be specific, when adversary want to generate image adversarial samples, they can guarantee perceptually similar simply by adding constraints on the norm of perturbation. While for text data, one also need to consider semantic or phonetic similarity. Therefore, in addition to the commonly used optimization method to attack seq2seq translation model (Cheng et al., 2018)

, some heuristic approaches are proposed to find substitute words that to attack RNN based dialogue generation models  

(Niu and Bansal, 2018).

Defense Strategies. Under adversarial attacks, there are different types of countermeasures to prevent the adversary to create harmful effects. During the training stage, Adversarial Training aims to train a robust model by using adversarial samples in the training process. Certified Defense tries to achieve robustness over all perturbations within a certain bound. For defenses that happen at inference time, Adversarial Example Detection tries to distinguish adversarial examples thus users can reject the prediction of the harmful examples.

3.3. Representative Attack Methods

In this subsection, we introduce representative attack methods from two aspects: Evasion Attacks and Poisoning Attacks.

3.3.1. Evasion Attack

Evasion attacks happen at test time. We further group the attack methods based on the type of perturbation budget, i.e., pixel constrained adversarial examples with a fixed norm bound and adversarial examples under other types of constrained.

bound Attacks. To guarantee perceptual similarity of the adversarial example and the natural example, the perturbation is normally constrained within an norm bound around the natural example. To find such perturbation, Projected Gradient Descent (PGD) adversarial attack  (Madry et al., 2017) tries to calculate the adversarial example

that maximizes the loss function:

Here, is the perturbation budget and is the dimension of the input sample . This local maximum is calculated by doing gradient ascent. At each time step, a small gradient step is made toward the direction of increasing loss value that is projected back to the norm bound. Currently, a representative and popular attack method are Autoattack  (Croce and Hein, 2020b), which is a strong evasion attack by assembling four attacks, including three white-box attacks and one black-box attack. Thus, it brings a more reliable evaluation for adversarial robustness.

There are also some works with special goals. The work  (Moosavi-Dezfooli et al., 2017) devises an algorithm that successfully misleads a classifier’s decision on almost all test images. It tries to find a perturbation under a constraint satisfying for any sample from the test distribution. Such a perturbation can let the classifier give wrong decisions on most of the samples.

Beyond bound attacks. Recently people started to realize that norm perturbation measurement is neither sufficient to cover real-world noise nor a perfect measurement for perceptual similarity. Some studies seek to find the minimal perturbation necessary to change the class of a given input with respect to the norm (Croce and Hein, 2020a). Other works propose different perturbation measurements, e.g., Wasserstein distance (Wu et al., 2020b; Wong et al., 2019), to measure the changes of pixels.

3.3.2. Poisoning Attacks

As we introduced, poisoning attacks allow adversaries to take control of the training process.

Training Time Attack. In the training time attack, perturbation only happens in the training time. For example, the ‘poisoning frog’ attack inserts an adversarial image with the true label to the training set, to make the trained model wrongly classify target test samples (Shafahi et al., 2018). It generates the adversarial example by solving the following problem:


is the logits of the model for samples

, and are the samples from the target class and the original class, respectively. The result would be similar to the base class in the input space, while sharing similar predictions with the target class. As a concrete example, some cat training samples are intentionally added some features of bird and still labeled as cat in training. As a consequence, it would mislead the model’s prediction on other bird images that also contains bird features.

Backdoor Attack. The backdoor attack requires perturbation happens in both training and test data. A backdoor trigger only known by the attacker is inserted into the training data to mislead the classifier to give a target prediction on all the test examples that contain the same trigger (Chen et al., 2017). This type of attack is particularly dangerous because the model behaves normally on natural samples, which makes it harder to notice.

3.4. Representative Defense Methods

In this subsection, we introduce representative defense methods from the aforementioned categories.

3.4.1. Robust Optimization / Adversarial Training

Adversarial training aims to train models that give resistant predictions to adversarial examples. The training objective is formulated as a min-max problem that tries to minimize the error risk on the maximum adversarial loss within a small area around the training data samples (Wang et al., 2020c). With this bi-level optimization process, the model achieves partial robustness, but still suffers from longer training time, natural and robust trade-offs, and robust overfitting issues. There are several works making efforts on improving the standard adversarial training from different perspectives. In (Tsipras et al., 2018), the trade-off issue is revealed. TRADES (Zhang et al., 2019) takes a step toward balancing the natural accuracy and robust accuracy by adding a regularization term to minimize the prediction difference on adversarial samples and natural samples. Other works  (Wong et al., 2020; Shafahi et al., 2019)

boost the training speed by estimating the gradient of the loss for the last few layers as a constant when generating adversarial samples. They can shorten the training time to

GPU time or even shorter with comparable robust performance. To mitigate robust overfitting, different classic techniques like early stop, weight decay, and data augmentations have been investigated  (Wu et al., 2020a; Rice et al., 2020). It is evident from recent work  (Carmon et al., 2019) that using data augmentation methods is one promising direction to further boost the adversarial training performance.

3.4.2. Certified Defense

Certified defense tries to learn provably robust DNNs against specific norm-bounded perturbations  (Raghunathan et al., 2018; Wong and Kolter, 2018). From empirical defenses, we achieve robustness to a certain extent, while for certified robust verification, we want to exactly answer the question if we can find an adversarial example for a given example and the perturbation bound. For instance, a randomized smoothing based classifier (Cohen et al., 2019) aims to build an exact smooth classifier by making decisions according to the majority of predictions of neighborhood examples. Reaching such smoothness requires considerably greater computation resources, which is a challenge in practice.

3.4.3. Detection

In order to distinguish the adversarial examples in data distribution and to prevent the harmful effect, people tend to design detection algorithms. A common way is to build another classifier to predict whether a sample is adversarial or not. The work  (Gong et al., 2017) trains a binary classification model to discriminate all adversarial examples apart from natural samples, and then builds ML models on recognized natural samples. Other works detect the adversarial samples based on the statistic property of adversarial sample distribution difference with natural sample distribution. In (Grosse et al., 2017), it uses a statistical test, i.e., Maximum Mean Discrepancy (MMD) test, to test whether two datasets are drawn from the same distribution. It uses this tool to test whether a group of data points are natural or adversarial. However, it is shown in (Carlini and Wagner, 2017a) that evasion adversarial examples are not easily detected. By bypassing several detection methods, it finds that those defenses lack thorough security evaluations and there is no clear evidence to support that adversarial samples are intrinsically different from clean samples. Recent work proposes a new direction for black-box adversarial detection (Chen et al., 2020b). It detects the attacker’s purpose based on the historical queries. It sets a threshold for the distance between two input image queries to detect the suspicious attempts of generating adversarial examples.

3.5. Applications in Real Systems

When deep learning is applied to real-world safety-critical tasks, the existence of adversarial examples becomes more dangerous and may cause severe consequences. Next, we illustrate the potential threats from adversarial examples to real-world applications from different domains.

3.5.1. Image Domain

In the auto-driving domain, road sign detection is an important task. However, with some small modifications, the road sign detection system (Eykholt et al., 2017; Sitawarin et al., 2018) in the vehicle would recognize mph as mph and cannot successfully detect a stop sign as shown in Figure 3. Deep learning is also widely applied in authentication tasks. An attacker can wear a special glass to pretend as an authorized identity to mislead the face recognition model as long as a few face samples with the glasses labeled as the target identity are inserted into the training set  (Chen et al., 2017). We also can try to avoid person detection by wearing an adversarial T-shirt (Xu et al., 2020b).

Figure 3. The stop sign could not be distinguished by machines with modifications.

3.5.2. Text Domain

Adversarial attacks also happen in many natural language processing tasks, including text classification, machine translation, and dialogue generation. For machine translation, sentence and word paraphrasing on input texts are conducted to craft adversarial examples

(Lei et al., 2018). It first builds a paraphrasing corpus that contains a lot of words and sentence paraphrases. To find an optimal paraphrase of an input text, a greedy method is adopted to search valid paraphrases for each word or sentence from the corpus. Moreover, it proposes a gradient-guided method to improve the efficiency of greedy search. In (Liu et al., 2019)

, it treats the neural dialogue model as a black-box and adopts a reinforcement learning framework to effectively find trigger inputs for targeted responses. The black-box setting is stricter but more realistic, while the requirements for the generated responses are properly relaxed. The generated responses are expected to be semantically identical to the targeted ones but not necessarily exactly match with them.

3.5.3. Audio Data

The state-of-art speech-to-text transcription networks, such as DeepSpeech  (Hannun et al., 2014) can be attacked by a small perturbation  (Carlini and Wagner, 2018). Given any speech waveform , an inaudible sound perturbation is added to make the synthesized speech recognized as any targeted desired phrase. In  (Saadatpanah et al., 2020), it proposes an adversarial attack method toward the Youtube CopyRight detection system to avoid music with copyright issues to be detected when uploading. It uses a Neural Network to extract features from the music piece and creates a fingerprint. The fingerprint is used for checking whether the music is the same as existing music with copyright. By doing a gradient-based adversarial attack on the original audio to create a large difference on the output fingerprint, the modified audio can successfully avoid the detection of Youtube CopyRight detection.

3.5.4. Graph Data

Zügner et al. (2018) consider attacking node classification models, Graph Convolutional Networks (Kipf and Welling, 2016), by modifying the node connections or node features. In this setting, an adversary is allowed to add or remove edges between nodes, or change the node features with a limited number of operations in order to mislead the GCN model which is trained on the perturbed graph. The work in (Zügner and Günnemann, 2019) attempts to poison the graph so that the global node classification performance of GCN will drop and even made almost useless. They optimize the graph structure as the hyper-parameters of GCN model with the meta-learning technique. The attacking goal of this work (Bojcheski and Günnemann, 2018) is to perturb the graph structure in order to corrupt the quality of node embedding thus the downstream tasks’ performance will be largely affected, including node classification or link prediction.

3.6. Surveys and Tools

In this subsection, we list current surveys and tools on adversarial robustness to help the readers to get easy access to the resources.

3.6.1. Surveys

Han et al. give a comprehensive introduction of concepts and go through representative attack and defense algorithms in different domains including image classification, graph classification, and natural language processing (Xu et al., 2020b). For the surveys in a specific domain, Akhtar et al.  (Akhtar and Mian, 2018) provide a comprehension introduction on adversarial threats in computer vision domain (Chakraborty et al., 2018), and Jin et al. give a thorough review on the latest adversarial robustness techniques in the graph domain (Jin et al., 2020), and Zhang et al.  (Zhang et al., 2020b) focus on natural language processing and summarize important algorithms on adversarial robustness in the text domain.

3.6.2. Tools


is a Pytorch toolbox that contains many popular attack methods in the image domain.

DeepRobust is a comprehensive and up-to-date adversarial attacks and defenses library based on Pytorch. It includes not only algorithms in the image domain but also the graph domain. This platform provides convenient access to different algorithms and also evaluation functions to illustrate the robustness of image classification models or graph properties. RobustBench provides a robust evaluation platform by Autoattack algorithm for different adversarial training models. This platform also provides well-trained robust models by different adversarial training methods, which can save resources for researchers.

3.7. Future Directions

For adversarial attacks, people are seeking for more general attacking methods to evaluate adversarial robustness. For black-box attacks, how to efficiently generate adversarial examples with fewer adversarial queries is often challenging. For adversarial defenses, an important issue for adversarial training is the robust overfitting and lacking generalization for both adversarial and natural examples. This problem remains unsolved and needs further improvement. Another direction for adversarial training is to build robust models towards more general adversarial examples, including but not limited to different l-p bound attacks. For certified defenses, one direction is to train a model with robust guarantee more efficiently, since the current certified defense methods require a large number of computational resources.

4. Non-discrimination & Fairness

A trustworthy AI system ought to avoid discriminatory behaviors in human-machine interaction and ensure fairness in decision-making for any individuals or groups. With the rapid spread of AI systems in our daily lives, more and more evidence demonstrates that AI systems show human-like discriminatory bias or make unfair decisions. For example, Tay, the online AI chatbot developed by Microsoft, produced a lot of improper racist and sexist comments, which eventually led to its closure within 24 hours after release (Wolf et al., 2017); dialogue models trained on human conversations show bias towards females and African Americans by generating more offensive and negative responses for these groups (Liu et al., 2020a). Moreover, a recidivism prediction software used by US courts often assigns a higher risky score for an African American than a Caucasian with a similar profile 111 (Mehrabi et al., 2019); a job recommendation system promotes more STEM employment opportunities to male candidates than to females (Lambrecht and Tucker, 2019). As AI plays an increasingly irreplaceable role in promoting the automation of our lives, fairness in AI is closely related to our vital interests and demands a considerable amount of attention. Recently, many works have emerged in this field to define, recognize, measure, and mitigate the bias in AI algorithms. In this section, we aim to give a comprehensive overview of the cutting-edge research progress addressing fairness issues in AI. In the following subsections, we first present concepts and definitions regarding fairness in AI. Then, we provide a detailed taxonomy to discuss different origins of algorithmic bias, different types of bias and fairness. Afterward, we review and classify popular bias mitigation technologies for building fair AI systems. Next, we introduce the specific bias issues and the applications of bias mitigation methods in real-world AI systems. In this part, we categorize the works according to the types of data processed by the system. Finally, we discuss the current challenges and future opportunities in this field. We expect that researchers and practitioners can gain a sense of direction and understanding from a broad overview of bias and fairness issues in AI along with a deep insight into the existing solutions, so as to advance the progress of this field.

4.1. Concepts and Taxonomy

Before we go deep into the non-discrimination and fairness in AI, we need to first understand how relative concepts, such as bias and fairness, are defined in this context. In this subsection, we briefly illustrate the concepts of bias and fairness, and provide a taxonomy to introduce different sources of bias, different types of bias and fairness.

4.1.1. Bias

In the machine learning field, the word “bias” has been abused. It conveys different meanings in different contexts. We first distinguish the concept of “bias” in the context of AI non-discrimination and fairness from that in other contexts. There are three categories of bias: productive bias, erroneous bias, and discriminatory bias. Productive bias exists in all machine learning algorithms. It is beneficial, and necessary for an algorithm to be able to model the data and make decisions (Hildebrandt, 2019). Based on the “no free lunch theory” (Wolpert and Macready, 1997), only if a predictive model is biased towards certain distributions or functions, it can achieve better performance on modeling them. Productive bias is such kind of bias which helps an algorithm in solving certain types of problems. It is introduced from our assumptions of the problem, which is specifically reflected as the choice of a loss function, an assumed distribution, or an optimization method, etc. Erroneous bias can be viewed as a systematic error caused by faulty assumptions. For example, we typically assume that the distribution of the training data is consistent with the real data distribution. However, due to selection bias (Marlin et al., 2012) or sampling bias (Mehrabi et al., 2019), the collected training data may not be able to represent the real data distribution. Thus, the violation of our assumption can lead to the learned model’s undesirable performance on the test data. Discriminatory bias is the kind of bias we are interested in under AI non-discrimination and fairness. Opposite to fairness, discriminatory bias reflects an algorithm’s unfair behaviors towards a certain group or an individual, such as producing discriminatory content for some people, or performing less well for some people (Shah et al., 2020). In the rest of this paper, when we mention “bias”, we mean discriminatory bias.

Sources of Bias. The bias in an AI system can be produced by different sources, namely, the data, the algorithm, or the evaluation method. Bias within data comes from different phases of data generation, from data annotation, data collection to data processing (Olteanu et al., 2019; Shah et al., 2019). In the phase of data annotation, bias can be introduced due to a non-representative group of annotators (Joseph et al., 2017), inexperienced annotators (Plank et al., 2014), or preconceived stereotypes held by the annotators (Sap et al., 2019). In the phase of data collection, bias can emerge due to the selection of data sources or how data from several different sources are acquired and prepared (Olteanu et al., 2019). In the data processing stage, bias can be generated due to data cleaning (Denny and Spirling, 2016), data enrichment (Cohen and Ruths, 2013), and data aggregation (Tufekci, 2014).

Types of Bias. Bias can be categorized into different classes from different perspectives. Bias can be explicit or implicit. Explicit bias, also known as direct bias, occurs when the sensitive attribute explicitly causes an undesirable outcome for an individual; while Implicit Bias, also known as indirect bias, indicates the phenomenon that an undesirable outcome is caused by non-sensitive and seemingly neutral attributes, which in fact have some potential associations with the sensitive attributes (Zhang et al., 2017). For example, the residential address seems a non-sensitive attribute but it can correlate with the race of a person according to the population distribution of different ethnic groups (Zhang et al., 2017). Moreover, the language style can somehow reflect the demographic features of a person, such as race and age (Huang et al., 2020; Liu et al., 2021a). Bias can be acceptable and unacceptable. Acceptable bias, also known as explainable bias, describes a situation where the discrepancy of outcomes for different individuals or groups can be reasonably explained by some factors. For example, models trained on the UCI Adult dataset predict higher salaries for males than females. Actually, this is because males work for a longer time per week than females (Kamiran and Žliobaitė, 2013). With this fact in mind, we think the biased outcomes are acceptable and reasonable. On the contrary, the bias that cannot be explained appropriately is treated as unacceptable bias, which we try to avoid in practice.

4.1.2. Fairness

The fairness of an algorithm is defined as “fairness is the absence of any prejudice or favoritism towards an individual or a group based on their intrinsic or acquired traits in the context of decision-making(Saxena et al., 2019; Mehrabi et al., 2019). Furthermore, according to the object of the study, fairness can be defined as group fairness and individual fairness.

Group Fairness. Group fairness requires that two groups of people with different sensitive attributes receive comparable treatments and outcomes statistically. Based on this principle, various definitions have been proposed, such as Equal Opportunity (Hardt et al., 2016a)

, which requires people from two groups to be equally likely to get a positive outcome when they indeed belong to the positive class; Equal Odds

(Hardt et al., 2016a)

, which requires that the probability of being classified correctly should be the same for different groups; and Demographic Parity

(Dwork et al., 2012), which requires different groups to have the same chance to get a positive outcome, etc.

Individual Fairness. While group fairness can maintain fair outcomes for a group of people, a model can still behave discriminatory at the individual level (Dwork et al., 2012). It is based on the intuition – similar individuals should be treated similarly. A model satisfies individual fairness if it gives similar predictions to similar individuals (Dwork et al., 2012; Kusner et al., 2017). Formally, if individuals and are similar under a certain metric , the difference between the predictions given by an algorithm on them should be small enough: , where is the predictive function of algorithm that maps an individual to an outcome, and is a small constant.

4.2. Methods

In this subsection, we introduce the bias mitigation techniques. Based on which stage of an AI pipeline to interfere, the debiasing methods can be categorized into three types: pre-processing, in-processing and post-processing methods. Representative bias mitigation methods are summarized in Table 2.

Category Strategy References
Pre-processing Sampling (Zhang and Neill, 2016; Adler et al., 2018; Bastani et al., 2019)
Reweighting (Calders and Žliobaitė, 2013; Kamiran and Calders, 2012; Zhang et al., 2020a)
Blinding (Hardt et al., 2016b; Chen et al., 2018b; Chouldechova and G’Sell, 2017; Zafar et al., 2017)
Relabelling (Hajian and Domingo-Ferrer, 2012; Kamiran and Calders, 2012; Cowgill and Tucker, 2017)
Adversarial Learning (Adel et al., 2019; Feng et al., 2019; Kairouz et al., 2019)
In-processing Reweighting (Krasanakis et al., 2018; Jiang and Nachum, 2020)
Regularization (Feldman et al., 2015b; Aghaei et al., 2019)
Bandits (Liu et al., 2017; Ensign et al., 2018)
Adversarial Learning (Zhang et al., 2018; Celis and Keswani, 2019; Liu et al., 2020b, 2021a)
Post-processing Thresholding (Hardt et al., 2016a; Menon and Williamson, 2017; Iosifidis et al., 2019)
Transformation (Kilbertus et al., 2017; Nabi and Shpitser, 2018; Chiappa, 2019)
Calibration (Hébert-Johnson et al., 2017; Kim et al., 2018)
Table 2. Representative debiasing strategies in the three categories.

Pre-processing Methods. Pre-processing approaches try to remove the bias in the training data to ensure the fairness of an algorithm from the origin (Kamiran and Calders, 2012). This category of methods can be adopted only when we have access to the training data. Various strategies are proposed to interfere with training data. Specifically, Celis et al. (2016) propose to adaptively sample the instances which are both diverse in features and fair to sensitive attributes for training. Moreover, reweighting methods (Kamiran and Calders, 2011; Zhang et al., 2020a) try to mitigate the bias in training data by adaptively up-weighting the training instances of underrepresented groups, while down-weighting those of overrepresented groups. Blinding methods try to make a classifier not sensitive to a protected variable. For example, Hardt et al. (2016b) force a classifier to have the same threshold value for different race groups, to ensure that the predicted loan rate is equal for all races. Some works (Kamiran and Calders, 2011; Zemel et al., 2013) try to relabel the training data to make the proportion of positive instances are equal across all protected groups. Besides, Xu et al. (2018)

take advantage of a generative adversarial network to produce bias-free and high-utility training data.

In-processing Methods. In-processing approaches address the bias at the algorithm level and try to eliminate bias during the model training process. They often seek to create a balance between performance and fairness (Caton and Haas, 2020). Krasanakis et al. (2018) propose an in-processing re-weighting approach. They first train a vanilla classifier to learn the weights of samples, and then retrain the classifier using these weights. Some works (Kamishima et al., 2012; Goel et al., 2018a) take advantage of regularization methods, where one or more penalty terms are added into the objective function, to penalize the biased outcomes. The idea of adversarial learning is also adopted in in-processing debiasing methods. Liu et al. (2020b) design an adversarial learning framework to train neural dialogue models which are free from gender bias. Alternatively, bandits recently emerge as a novel idea for solving fairness problems. For example, Joseph et al. (2016) propose to solve the fairness problem under a stochastic multi-armed bandit framework with fairness metrics as the rewards, and the individuals or groups under investigation as the arms.

Post-processing Methods. Post-processing approaches directly make transformations on the model’s outputs to ensure fair final outcomes. Hardt et al. (2016b) propose approaches to determine threshold values via measures such as equalized odds specifically for different protected groups to find a balance between the true and false positive rates to minimize the expected classifier loss. Feldman et al. (2015a) propose a transformation method to learn a new fair representation of the data. Specifically, they transform the SAT score into a distribution of the rank order of the students independent of gender. Pleiss et al. (2017) borrow the idea of calibration to build fair classifiers. Similar to the traditional definition of calibration that the proportion of positive predictions should be equal to the proportion of positive examples, they force the conditions to hold for the different groups of people. Nevertheless, they also find that there is a tension between prediction accuracy and calibration.

4.3. Applications in Real Systems

In this subsection, we summarize the studies regarding bias and fairness issues in real-world AI systems in different tasks. We introduce the works following the order of different data domains, including tabular data, images, texts, audios, and graphs. For each domain, we describe several representative tasks and present how AI systems can be biased on these tasks. A summary of the representative works can be found in Table 3.

Domain Task References
Tabular Data Classification (Kamiran and Calders, 2009; Calders et al., 2009; Calders and Verwer, 2010; Hardt et al., 2016b; Goel et al., 2018b; Menon and Williamson, 2018)
Regression (Berk et al., 2017; Agarwal et al., 2019)
Clustering (Backurs et al., 2019; Chen et al., 2019b)
Image Data Image Classification (Pachal, 2015)
Face Recognition (Buolamwini and Gebru, 2018; Howard and Borenstein, 2018)
Object Detection (Ryu et al., 2017)
Text Data Text Classification (Kiritchenko and Mohammad, 2018; Park et al., 2018; Dixon et al., 2018; Borkan et al., 2019; Zhang et al., 2020a; Huang et al., 2020)
Embedding (Bolukbasi et al., 2016a; Brunet et al., 2019; Gonen and Goldberg, 2019; Zhao et al., 2019; Papakyriakopoulos et al., 2020; May et al., 2019b)
Language Modeling (Bordia and Bowman, 2019a; Sheng et al., 2019; Lu et al., 2020; Gehman et al., 2020; Yeo and Chen, 2020)
Machine Translation (Vanmassenhove et al., 2019; Stanovsky et al., 2019; Cho et al., 2019; Basta et al., 2020; Gonen and Webster, 2020)
Dialogue Generation (Liu et al., 2020a; Dinan et al., 2020; Curry et al., 2020)
Audio Data Speech Recognition (Rodger and Pendharkar, 2004; Carty, 2011; Tatman, 2016; Howard and Borenstein, 2018)
Graph Data Node Embedding (Bose and Hamilton, 2019)
Graph Modeling (Dai and Wang, 2021)
Table 3. A summary of bias detection works in different data domains.

4.3.1. Tabular Domain

Tabular data is the most common format of data in machine learning, and thus the research on bias in machine learning is predominantly conducted on tabular data. In the recent decade, researchers have investigated how algorithms can be biased in the classification, regression and clustering tasks. For classification, researchers find evidence that machine learning models for credit prediction (Kamiran and Calders, 2009), recidivism prediction (Chouldechova, 2017) tasks can show significant prejudice towards certain demographic attributes of a person, such as race and gender. Berk et al. (2017) and Agarwal et al. (2019) investigate multiple regression tasks, from salary estimation to crime rate prediction. They show the unfair treatment for different races and genders. (Backurs et al., 2019) and (Chen et al., 2019b) evaluate the fairness in clustering algorithms with a belief that as data points, different groups of people are entitled to be clustered with the same accuracy.

4.3.2. Image Domain

Machine learning models in computer vision have shown unfair behaviors as well. In (Buolamwini and Gebru, 2018; Howard and Borenstein, 2018), the authors showed that face recognition systems work better for white faces compared to darker faces. An image classification application developed by Google is accused of labeling black people as “gorillas” (Pachal, 2015). The work in (Howard and Borenstein, 2018) balanced the dataset for face recognition tasks aiming at alleviating gender bias. In (Ryu et al., 2017)

, the authors employed a transfer learning method and improved smiling detection against gender and race discrimination. The work

(Zhao et al., 2017) tackled the social bias in visual semantic role labeling, e.g., associating cooking roles to women. They introduced corpus-level constraints for calibrating existing structured prediction models. In the work (Wang et al., 2020c), a visual recognition benchmark is designed for studying bias mitigation.

4.3.3. Text Domain

A large number of works have shown that algorithmic bias exists in various natural language processing tasks. Word Embeddings often exhibit a stereotypical human bias for text data, causing a serious risk of perpetuating problematic biases in imperative societal contexts. In (Bolukbasi et al., 2016b), the authors first showed that popular state-of-the-art word embeddings regularly mapped men to working roles and women to traditional gender roles, thus led to significant gender bias in word embeddings and even downstream tasks. Following the research of word embeddings, the same patterns of gender bias are discovered in sentence embeddings (May et al., 2019a). In the task of coreference resolution, researchers demonstrated in (Zhao et al., 2018)

that rule-based, feature-based, and neural network-based coreference systems all show gender bias by linking gendered pronouns to pro-stereotypical entities with higher accuracy than anti-stereotypical entities. Language models can also learn gender discrimination from man-made text data

(Bordia and Bowman, 2019b). They tend to generate certain words reflecting gender stereotypes with different probabilities in the context of males and females. As for machine translation, it has been illustrated that Google’s translation system suffers from gender bias by showing favoritism toward males for stereotypical fields such as STEM jobs when translating sentences taken from the U.S. Bureau of Labor Statistics into a dozen gender-neutral languages (Prates et al., 2019). Dialogue systems, including generative models and retrieval-based models, also show bias towards different genders and races by producing discriminatory responses (Liu et al., 2020a, b).

4.3.4. Audio Domain

Voice recognition systems show gender bias by performing differently in processing voices of men and women (Howard and Borenstein, 2018). It was found that medical voice-dictation systems recognize voice inputs from males versus females with higher accuracy (Rodger and Pendharkar, 2004). It was shown in (Carty, 2011) that voice control systems on vehicles worked better for males than females. Google’s speech recognition software can understand queries from male voices more consistently than those from females (Tatman, 2016).

4.3.5. Graph Domain.

ML applications on graph-structured data are ubiquitous in the real world. The fairness issues in these problems are drawing increasing attention from researchers. Existing graph embedding techniques can learn node representations that are correlated with protected attributes, such as age and gender. Consequently, they exhibit bias towards certain groups in real-world applications like social network analysis and recommendations (Bose and Hamilton, 2019). Graph neural networks (GNNs) also show their ability to inherit bias from training data and even magnify the bias through the graph structures and message-passing mechanism of GNNs (Dai and Wang, 2021).

4.4. Surveys and Tools

In the subsection, we gather the existing surveys, tools and repositories on fairness in AI, to facilitate the readers to further explore this field.

4.4.1. Surveys

The problem of fairness has been studied in multiple disciplines other than computer science for more than a half century. In one survey (Hutchinson and Mitchell, 2019), the authors trace the evolution of the notions and measurements of fairness in different fields, such as education and hiring, over the past 50 years. They provide a comprehensive comparison between the past and current definitions, to encourage a deeper understanding of modern fairness in AI. Zliobaite (2015) provide an early survey on measuring indirect discrimination in machine learning. In this survey, the authors review early approaches for measuring bias in data and predictive models. They also analyze the measurements from other fields and explore the possibility of them being used in the context of machine learning. Corbett-Davies and Goel (2018) provide a critical review on the measurements of fairness, where they show the limitations of the existing fairness criteria in classification tasks in machine learning. Mehrabi et al. (2019) contribute a comprehensive survey on bias and fairness in machine learning. In this survey, the authors provide a detailed taxonomy of the bias and fairness definitions in machine learning, and also introduce the bias observed in the data and algorithms in different domains of AI, as well as the state-of-the-art debiasing methods. Caton and Haas (2020) provide an overview of the existing debiasing approaches for building fair machine learning models. They organize the extant works into 3 categories and 11 method areas and introduce them following their taxonomy. Moreover, there are some surveys regarding bias and fairness in specific domains of AI. Blodgett et al. (2020) review the papers analyzing bias in NLP systems. They provide critical comments on such works and point out that many existing works suffer from unclear and inconsistent motivations and irrational reasoning. They also offer suggestions to normalize future studies on bias in NLP. Chen et al. (2020a) summarize and organize the works on bias and debias in recommender systems, and discuss future directions in this field.

4.4.2. Tools

In recent years, some organizations or individual researchers provide multi-featured toolkits and repositories to facilitate fair AI. The repository Responsibly (Louppe et al., 2016) collects the datasets and measurements for evaluating bias and fairness in classification and NLP tasks. The project FairTest (Tramer et al., 2017) provides an unwarranted associations (UA) framework to discover unfair user treatment in data-driven algorithms. AIF360 (Bellamy et al., 2018) collects popular datasets for fairness study and provides the implementations of common debiasing methods for binary classification. Aequitas (Saleiro et al., 2018) is released as an audit toolkit to test the bias and fairness of models for multiple demographic groups on different metrics. The repository Fairness Measurements222 provides datasets and codes for quantitatively measuring discrimination in classification and ranking tasks.

4.5. Future Directions

Fairness research still possesses a number of outstanding challenges:

  • Trade-off between fairness and performance. Studies on fairness in different fields have confirmed the existence of the trade-off between fairness and performance of an algorithm (Corbett-Davies et al., 2017; Prost et al., 2019; Berk et al., 2021). The improvement of the fairness of an algorithm typically comes at the cost of performance degradation. Since both fairness and performance are indispensable, extensive research is needed to help people better understand an algorithm’s trade-off mechanism between them, so that practitioners can adjust the balance between them in practical usage based on the actual demand;

  • Precise conceptualization of fairness. Although plenty of research works have been conducted to study bias and fairness in AI, many of them formulate their problems under a vague “bias” concept that refers to any harmful system behaviors to humans, but fail to provide a precise definition of bias or fairness that is specific to their setting (Blodgett et al., 2020). In fact, different forms of bias can appear in different tasks, and even in the same task. For example, in a recommender system, popularity bias can exist towards both the users and items (Chen et al., 2020a). In a toxicity detection algorithm, race bias can exist towards both the people mentioned in texts and the authors of texts (Liu et al., 2021a). For any fairness problem to study, a precise definition of bias that describes how, to whom, and why an algorithm can be harmful needs to be articulated. In this way, we can make the research on AI fairness in the whole community more standardized and systematic;

  • From equality to equity. Fairness definitions are often associated with equality, to ensure that an individual or a conserved group, such as race or gender, are given similar amounts of resources, consideration and results. Nonetheless, the area of equity has been heavily under-examined (Mehrabi et al., 2019), where this notion pertains to the particular resources for an individual or a conserved group to be successful (Gooden, 2015). Equity remains an interesting future direction as the exploration of this definition can extend or contradict existing definitions of fairness in machine learning.

5. Explainability

The improved predictive performance of AI systems has often been achieved through increased model complexity (Doshi-Velez and Kim, 2017; Molnar, 2020). A prime example is the paradigm of deep learning, dominating the heart of most state-of-the-art AI systems. However, they are treated as black-boxes, since most deep models are too complicated and opaque to be understood, as well as they are developed without explainability (Linardatos et al., 2021). More importantly, without explaining the underlying mechanisms behind the predictions, deep models cannot be fully trusted, which prevents their uses in critical applications pertaining to ethics, justice, and safety, such as healthcare (Miotto et al., 2018), autonomous cars (Levinson et al., 2011), and so on. Therefore, building a trustworthy AI system requires understanding on how particular decisions are made (Forum, 2020), which has led to the revival of the field of eXplainable Artificial Intelligence (XAI for short). In this section, we aim to provide intuitive understandings and high-level insights on the recent progress of explainable AI. First, we provide the concepts and taxonomy regarding explainability in AI. Second, we review representative explainable techniques for AI systems according to the aforementioned taxonomy. Afterward, we introduce the real-world applications of explainable AI techniques. Finally, we provide some surveys and tools, and discuss future opportunities on explainable AI.

5.1. Concepts and Taxonomy

In this subsection, we first introduce the concepts of explainability in AI. Afterward, we provide a taxonomy of different explanation techniques.

5.1.1. Concepts

In the context of machine learning and AI literature, explainability and interpretability are usually used by researchers interchangeably (Molnar, 2020). One of the most popular definitions of explainability is the one from Doshi-Velez and Kim, who define it as “the ability to explain or to present in understandable terms to a human” (Doshi-Velez and Kim, 2017). Another popular definition is from Miller, where he defines explainability as “the degree to which a human can understand the cause of a decision” (Miller, 2019). In general, the higher the explainability of an AI system is, the easier it is for someone to comprehend how certain decisions or predictions have been made. Meanwhile, a model is better explainable than other models if its decisions are easier for a human to comprehend than those from the others.

While explainable AI and interpretable AI are very closely related, some subtle differences between them are discussed in some studies (Rudin, 2019; Gilpin et al., 2018; Zhang and Chen, 2018).

  • A model is ”interpretable” if the model itself is capable of being understood by humans on its predictions. When looking at the model parameters or a model summary, humans can understand exactly the procedure on how it made a certain prediction/decision, and even given a change in input data or algorithmic parameters, it is the extent to which humans can predict what is going to happen. In other words, such models are intrinsically transparent and interpretable, rather than black-box/opaque models. Examples of interpretable models include decision trees and linear regression.

  • An ”explainable” model indicates that additional (post hoc) explanation techniques are adopted to help humans understand why it made a certain prediction/decision it did, although the model is still black-box and opaque. Noted that such explanations are often not reliable, and can be misleading. Examples of such models would be deep neural networks based models, where the models are usually too complicated for any human to comprehend.

5.1.2. Taxonomy

Techniques for AI’s explanation can be grouped according to various criteria.

  • Model usage: model-intrinsic and model-agnostic. If the application of interpretable techniques is only restricted to a specific architecture of an AI model, then these interpretable techniques are called model-intrinsic explanation. In contrast, the techniques that could be applied in every possible algorithm are called model-agnostic explanation.

  • Scope of Explanation: local and global. If the method provides an explanation only for a specific instance, then it is a local explanation, and if the method explains the whole model, then it is a global explanation.

  • Differences in the methodology: gradient-based and perturbation-based. If the techniques employ the partial derivatives on input instances to generate attributions, then these techniques are called gradient-based explanation, and if the techniques focus on the changes or modifications of input data, we name them as perturbation-based explanation.

  • Explanation via Other Approaches: Counterfactual Explanations. We present other explanation techniques that cannot be easily categorized with the previous groups. A counterfactual explanation usually refers to a causal situation in the form: “If X had not occurred, Y would not have occurred”. In general, counterfactual explanation methods are model-agnostic and can be used to explain predictions of individual instances (local) (Wachter et al., 2017).

5.2. Methods

In this subsection, we introduce some representative explanation techniques according to the aforementioned taxonomy.

Figure 4. The interpretation of decision tree is simple, where intermediate nodes in the tree represent decisions and leaf nodes can be class labels. Starting from the root node to leaf nodes can create good explanations on how the certain label is made by decision tree model. (Image Credit: (Molnar, 2020))

5.2.1. Model usage

Any explainable algorithm which is dependent on the model architecture can fall into the model-intrinsic category. In contrast, model-agnostic methods apply to any model for being generally applicable. In general, there are significant research interests in developing model-agnostic methods to explain the predictions of an existing well-performing neural networks model. This criterion also can be used to distinguish whether interpretability is achieved by restricting the complexity of the AI model. Intrinsic interpretability refers to AI models that are considered interpretable (white-box) due to their simple model architecture, while most model-agnostic explanations are widely applied into (black-box) deep neural networks which are highly complicated and opaque due to their millions of parameters.

  • Model-intrinsic Explanations

    The model in this category is often called intrinsic, transparent, or white-box explanation. Generally, without designing an additional explanation algorithm, this type of interpretable techniques cannot be re-used by other classifier architectures. Therefore, the model-intrinsic methods of explanations are inherently model-specific. Such commonly used interpretable models include linear/logistic regression, decision trees, rule-based models, Generalized Additive Models (GAMs), Bayesian networks, etc.

    For example, the linear regression model (Bishop, 2006), as one of the most representative linear models in ML, aims to predict the target as a weighted sum of the feature of instances. With this linearity of the learned relationship, the linear regression model makes the estimation procedure simple and significantly understandable on a modular level (i.e. the weights) for humans. Mathematically, given one instance with dimension of features , linear regression model can be used to model the dependence of a predicted target as follows:


    where and denote the learned feature weights and the bias term, respectively. The predicted target of linear regression is a weighted sum of its dimension features for any instance, where the decision making procedure is easy for a human to comprehend by inspecting the value of the learned feature weights .

    Another representative method is decision tree (Quinlan, 1986), which contains a set of conditional statements arranged hierarchically. Making predictions in decision tree is also the procedure of explaining the model by seeking the path from the root node to leaf nodes (label), as illustrated in Figure 4.

    Figure 5. LIME expalanations for a pre-trained image classification model. The top 3 predicted classes are “Electric Guitar” (p = 0.32), “Acoustic guitar” (p = 0.24) and “Labrador” (p = 0.21). (Image Credit: (Ribeiro et al., 2016))
    Figure 6. GNNExplainer generates an explanation by identifying a small subgraph of the input graph for graph classification task on molecule graphs dataset (MUTAG). (Image Credit:  (Ying et al., 2019))
    Figure 7. Image-specific class saliency maps were extracted using a single back-propagation pass through a DNN classification model. (Image Credit:  (Simonyan et al., 2013))
  • Model-agnostic Explanations

    The methods in this category are concerned with black-box well-trained AI models. More specifically, such methods do not try to create interpretable models, but to interpret already well-trained models. Such methods are widely used for explaining complicated models, such as deep neural networks. That is also why they sometimes are referred to as post-hoc or black-box explainability methods in the related scientific literature. The great advantage of model-agnostic explanation methods over model-intrinsic ones is their flexibility. Model-Agnostic methods are also widely applied in a variety of input modalities such as images, text, graph-structured data, etc. Note that model-agnostic methods can also be applied to intrinsically interpretable models.

    One of the most representative works in this category is Local Interpretable Model-Agnostic Explanations (LIME(Ribeiro et al., 2016). For example, at the image domain, for any trained classifier, LIME is a proxy model to randomly permute data by identifying the importance of local contiguous super-pixels (a group of pixels) in a given instance and its corresponding label. An illustrative example of LIME on a single instance for top 3 predicted classes is shown in Figure 5.

    Additionally, to understand how any graph neural networks (GNNs) make a certain decision on graph-structured data, GNNExplainer learns soft masks for edges and node features to explain the predictions via maximizing the mutual information between the predictions of the original graph and those of the newly obtained graph (Ying et al., 2019; Luo et al., 2020). Figure 6 illustrates explanation examples generated by GNNExplainer for graph-structured data.

Figure 8. Numerically computed images, illustrating the class appearance models. (Image Credit:  (Simonyan et al., 2013))

5.2.2. Scope of Explanation

One important aspect of dividing the explainability techniques is based on the scope of explanation, i.e., local or global.

  • Local Explanations

    In general, the goal of locally explainable methods is to express the individual feature attributions of a single instance of input data from the data population . For example, given a text document and a model to understand the sentiment of text, a locally explainable model might generate attribution scores for individual words in the text.

    In Saliency Map Visualization method (Simonyan et al., 2013), the authors compute the gradient of the output class category with regard to an input image. By visualizing the gradients, a fair summary of pixel importance can be achieved by studying the positive gradients which have more influence on the output. An example of the class model is shown in Figure 7.

  • Global Explanations

    The goal of global explanations is to provide insights for the decision of the model as a whole, and to have an understanding about attributions for a batch of input data or a certain label, not just for individual inputs. In general, globally explainable methods work on an array of inputs to summarize the overall behavior of the black-box model. Most of linear models, rule-based and tree-based models are inherently globally explainable. For example, conditional statements (intermediate nodes) in decision trees can give insights to how the model behaves in a global view, as shown in Figure 4.

    In terms of the DNNs models, Class Model Visualization (Simonyan et al., 2013) is trying to generate a particular image visualization by maximizing the score of class probability with respect to the input image. An example of the class model is shown in Figure 8.

Figure 9. Gradient-based Explanation: CAM model localizes class-specific image regions. (Image Credit:  (Zhou et al., 2016))
Figure 10. Gradient-based Explanation: Grad-CAM model highlights image regions considered to be important for producing the captions. (Image Credit:  (Selvaraju et al., 2017))

5.2.3. Differences in the methodology

This category is mainly defined by answering the question: ”What is the algorithmic approach? Does it focus on the input data instance or the model parameters?”. Based on the core algorithmic approach of the explanation method, we can categorize explanation methods as the ones which focus on the gradients of the target prediction with respect to input data, and those which focus on the changes or modifications of input data.

  • Gradient-based Explanations

    In gradient-based methods, the explainable algorithm does one or more forward passes through the neural networks and generates attributions during the back-propagation stage utilizing partial derivatives of the activations. This method is the most straightforward solution and has been widely used in computer vision to generate human understandable visual explanations.

    Class activation mapping (CAM) (Zhou et al., 2016) replaces fully-connected layers with convolutional layers and global average pooling on CNN architectures, and localizes class-specific discriminative regions with a single forward-pass. An illustrative example is provided in Figure 9. Afterward, Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al., 2017) generalizes CAM model for any CNN model without requiring architectural changes or re-training and utilizes the gradient signal flowing into the final convolutional layer of a CNN for highlighting the important regions in the image. Figure 10

    shows a visual explanation via Grad-CAM for an image captioning model.

  • Perturbation-based Explanations

    Perturbation-based explainable methods focus on variations in the input feature space to explain individual feature attributions towards the output class. More specifically, explanations are generated by iteratively probing a trained AI model with different variations of the inputs. These perturbations can be on a feature level by replacing certain features with zero or random counterfactual instances, picking one or a group of pixels (super-pixels) for explanation, blurring, shifting, or masking operations, etc. In general, only forward pass is enough to generate the attribution representations without the need for back-propagating gradients. Shapley Additive explanations (SHAP) (Lundberg and Lee, 2017) visualizes feature interactions and feature importance by probing feature correlations via removing features in a game-theoretic framework.

5.2.4. Explanation via Other Approaches: Counterfactual Explanations

In general, counterfactual explanations have been designed to answer hypothetical questions, and describe how altering feature values of an instance would change the prediction to a predefined output (Molnar, 2020; Mittelstadt et al., 2019). Taking credit card application as an example, Peter gets rejected by AI banking systems and wonders why his application was rejected. To answer the question of ”why”, counterfactual explanations can be formulated as ”What would have happened to this decision (from rejected to approved), if performing minimal changes in feature values (e.g., income, age, race, etc.)?”. In fact, counterfactual explanations are usually human-friendly, since they are contrastive to the current instances and usually focus on a small number of features. To generate counterfactual explanations, Wachter et. al (Wachter et al., 2017) propose Lagrangian style constrained optimization as follows:


where is the original instance feature, and is the corresponding counterfactual input. is the predicted result of a classifier. The first term is the quadratic distance between the model prediction for the counterfactual

and the targeted output

. The second term indicates the distance between the instance to be explained and the counterfactual . And, is proposed to achieve the trade-off between the distance in prediction and the distance in feature values.

Representative Models Model Usage Scope Methodology
Linear model Intrinsic Global -
LIME (Ribeiro et al., 2016) Agnostic Both Perturbation
CAM (Zhou et al., 2016) Agnostic Local Gradient
Grad-CAM (Selvaraju et al., 2017) Agnostic Local Gradient
SHAP (Lundberg and Lee, 2017) Agnostic Both Perturbation
Saliency Map Visualization (Simonyan et al., 2013) Agnostic Local Gradient
GNNExplainer (Ying et al., 2019) Agnostic Local Gradient
Class Model Visualization (Simonyan et al., 2013) Agnostic Local Gradient
Surveys  (Doshi-Velez and Kim, 2017; Guidotti et al., 2018; Zhang and Chen, 2018; Miller, 2019; Du et al., 2019; Molnar, 2020; Belle and Papantonis, 2020; Tjoa and Guan, 2020; Arrieta et al., 2020; Yuan et al., 2020b; Jiménez-Luna et al., 2020; Danilevsky et al., 2020; Linardatos et al., 2021)
Table 4. Summary of Published Research in Explainability of AI Systems.

5.3. Applications in Real Systems

In this subsection, we discuss representative real-world applications where explainability is crucial.

5.3.1. Recommender Systems

Recommender systems (RecSys) have become increasingly important in our daily lives since they play an important role in mitigating the information overload problem (Fan et al., 2019c, 2020). These systems provide personalized information to help human decisions and have been widely used in various user-oriented online services (Fan et al., 2021), such as e-commerce items recommendations for everyday shopping (e.g., Amazon, Taobao), job recommendations for employment markets (e.g., LinkedIn), and friends recommendations to make people better connected (e.g., Facebook, Weibo) (Fan et al., 2019a, b). Recent years have witnessed the great development of deep learning based recommendation models, in terms of the improving accuracy and broader application scenarios (Fan et al., 2019a). Thus, increasing attention has been paid on understanding why certain items have been recommended by deep learning based recommender systems for end users, because providing good explanations of personalized recommender systems can sufficiently motivate users to interact with items, help users make better and/or faster decisions, and increase users’ trust in the intelligent recommender systems (Ma et al., 2019a; Zhang et al., 2014). For example, to achieve explainability in recommender systems, RuleRec (Ma et al., 2019a) proposes a joint learning framework for accurate and explainable recommendations by integrating induction of several explainable rules from item association, such as Also view, Buy after view, Also buy, and Buy together. The work (Wang et al., 2018)

proposes a tree-enhanced embedding method that seamlessly combines embedding-based methods with decision tree-based approaches, where a gradient boosting decision trees (GBDT) and an easy-to-interpret attention network are introduced to make the recommendation process fully transparent and explainable from the side information of users and items.

5.3.2. Drug discovery

In the past few years, explainable AI has been proven to significantly accelerate the process of computer-assisted drug discovery (Vamathevan et al., 2019; Jiménez-Luna et al., 2020), such as molecular design, chemical synthesis planning, protein structure prediction, and macromolecular target identification. For example, explanations of graph neural networks have been conducted on a set of molecules graph labeled for their mutagenic effect on the Gram-negative bacterium Salmonella typhimurium, with the goal of identifying several known mutagenic functional groups and  (Ying et al., 2019; Luo et al., 2020; Yuan et al., 2020a). A recent work (Preuer et al., 2019) studies how the interpretation of filters within message-passing networks can lead to the identification of relevant toxicophore- and pharmacophore-like sub-structures for explainability, so as to help increase their reliability and foster their acceptance and usage in drug discovery and medicinal chemistry projects.

5.3.3. Natural Language Processing (NLP)

As one of the most broadly applied areas of AI, Natural Language Processing (NLP) investigates the use of computers to process or to understand human (i.e., natural) languages (Deng and Liu, 2018)

. Applications of NLP are everywhere, including dialogue systems, text summarization, machine translation, question answering, sentiment analysis, information retrieval, etc. Recently, deep learning approaches have obtained very promising performance across many different NLP tasks, which comes at the expense of models becoming less explainable 

(Danilevsky et al., 2020; Mullenbach et al., 2018). To address the issue, CAML (Mullenbach et al., 2018) employs an attention mechanism to select the most relevant segments that are most relevant for medical codes (ICD) from clinical text. LIME (Ribeiro et al., 2016) learns surrogate models by generating random input perturbations of x, so as to explain the output for input x for text classification with SVM models.

5.4. Surveys and Tools

In this subsection, we introduce existing surveys, tools and repositories on explainability in AI, so as to facilitate the readers to further explore this field.

5.4.1. Surveys

In the book (Molnar, 2020), the author focuses on interpretable machine learning by introducing from fundamental concepts to advanced interpretable models. For example, it first details related concepts of interpretability, followed by the intrinsically interpretable models, such as linear regression, decision tree, rule-based methods, etc. Afterward, the book provides general model-agnostic tools for interpreting black-box models and explaining individual predictions. Doshi-Velez et al. (Doshi-Velez and Kim, 2017) raises the importance of intractability in machine learning and introduces a comprehensive survey at this field. There are surveys  (Gilpin et al., 2018; Belle and Papantonis, 2020; Guidotti et al., 2018; Linardatos et al., 2021; Du et al., 2019; Arrieta et al., 2020) summarizing explanation approaches in machine learning. In addition, there also exist comprehensive surveys for specific applications, such as recommender systems (Zhang and Chen, 2018), medical information systems (Tjoa and Guan, 2020), natural language processing (Danilevsky et al., 2020), graph neural networks (Yuan et al., 2020b), Drug discovery (Jiménez-Luna et al., 2020), etc.

5.4.2. Tools

In this subsection, we introduce several popular toolkits that are open-sourced in the GitHub platform for explainable AI.

AIX360333 (AI Explainability 360) (Arya et al., 2020)

is an open-source Python toolkit for featuring state-of-the-art explainability methods and some evaluation metrics. Meanwhile, AIX360 also provides educational materials for non-technical stakeholders to quickly get familiar with interpretation and explanation methods.

InterpretML444 (Nori et al., 2019) is also an open-source Python toolkit which exposes machine learning interpretability algorithms to practitioners and researchers. InterpretML exposes two types of interpretability – glassbox for machine learning models with model-intrinsic explanations, and black-box explainability techniques for explaining any existing AI systems. The package DeepExplain (Ancona et al., 2018) mainly supports various gradient-based techniques and perturbation-based methods555

5.5. Future Directions

In this subsection, we discuss potential future directions for future research in explainable AI. Since the interpretability of AI is a relatively new and still a developing area, many open problems need to be considered.

  • Security of interpretable AI. Recent studies have demonstrated that due to their data-driven nature, explanations of AI models are vulnerable to malicious manipulations. Attackers attempt to generate adversarial examples which not only can mislead a target classifier but also can deceive its corresponding interpreter (Zhang et al., 2020c; Ghorbani et al., 2019). It naturally raises potential security concerns on interpretations. Therefore, how to defend against adversarial attacks on interpretation would be an important future direction.

  • Evaluation Methodologies. Evaluation metrics are crucial for studying explanation methods. However, due to the lack of ground truths and human subjective understandings, evaluating whether the explanations are reasonable and correct towards certain predictions is becoming intractable. The widely used evaluation methodology is based on human evaluations through visualizing the explanations, which is time-consuming and bias to human subjective understandings. Although there are some initial studies on the evaluation of interpretability (Doshi-Velez and Kim, 2017), it is still unclear how to measure ”What is a Good Explanation?”. It is crucial to investigate qualitative and quantitative evaluations of interpretability.

  • Knowledge to Target model: from white-box to black-box. Most existing explanation techniques request to have the full knowledge of the explained AI system (denoted as white-box). However, knowledge regarding target AI systems is often limited in many scenarios because of privacy and security concerns. Therefore, an important direction is to understand how an explanation can be generated for making decisions in black-box systems.

6. Privacy

The success of modern AI systems is built upon data, and data might contain private and sensitive information - from credit card data to medical records, from social relations to family trees. To establish trustworthy AI systems, we must guarantee the safety of private and sensitive information carried by the data and models which could be potentially exposed throughout the AI system. Therefore, increasing attention has been paid to the protection and regulation of data privacy. From a legal perspective, laws from the state level to the global level have started to provide mandatory regulations for data privacy. For instance, California Consumer Privacy Act (CCPA) was signed into law in 2018 to enhance privacy rights and consumer protection in California by giving consumers more control over the personal information that businesses collect; Health Insurance Portability and Accountability Act (HIPAA) was created in 1996 to protect individual healthcare information by requiring authorization before disclosing personal healthcare information; European Union announced General Data Protection Regulation (GDPR) to protect data privacy by giving the individual control over the personal data collection and usage.

From the perspective of science and technology, although most AI technologies haven’t considered privacy as the fundamental merit when they are first developed, to make modern AI systems trustworthy in privacy protection, a subfield of AI, privacy-preserving machine learning (PPML), has set privacy protection as the priority and starts to pioneer principled approaches for preserving privacy in machine learning. Specifically, researchers uncover the vulnerabilities of existing AI systems from comprehensive studies and develop promising technologies to mitigate these vulnerabilities. In this section, we will provide a summary of this promising and important field. Specifically, the basic concepts and taxonomy will be first discussed, and the risk of privacy breaches will be explained through various privacy attacking methods. Mainstream privacy-preserving technologies such as confidential computing, federated learning and differential privacy will be included, followed by discussions on applications in real systems, existing surveys and tools, and the future directions.

6.1. Concepts and Taxonomy

In the context of privacy protection, the adversarial goal of an attacker is to extract information about the data or machine learning models. According to the accessible information the adversary has, the attacker can be categorized into white-box or black-box. In a white-box setting, we assume that the attacker has all information except the data we try to protect and the attacker aims to attack. In a black-box setting, the attacker has very limited information, such as the query results returned by the model. According to when the attack might happen, the privacy breach can happen in training phase or inference phase. In the training phase, the adversary might be able to directly access or infer the information about the training data when she can inspect or even tamper with the training process. In the inference phase, the adversary might infer the input data of the model by inspecting the output characteristics. According to the capability of the adversary, the attacker can be honest-but-curious or fully malicious. An honest-but-curious attacker can inspect and monitor the training process while a fully malicious attacker can further tamper the training process. These taxonomies are not exclusive since they view the attacker from different perspectives.

6.2. Methods

We will highlight the risk of privacy leakage by introducing some representative privacy attack methods. Then, some mainstream techniques for privacy-preserving will be introduced.

6.2.1. Privacy Attack

Privacy attacks can target on training data, input data, properties of data population, and even the machine learning model itself. We introduce some representative privacy attacks to reveal the risk of privacy breaches.

Membership Inference Attack. To investigate how machine learning models leak information about the individual data within the training data, the membership inference attack aims to identify whether a data record is used in the training of model learning models. For instance, given the black-box access to the model, an inference model can be trained to recognize the differences of a target model’s predictions on the inputs that are used in its training or not (Shokri et al., 2017). Empirically, it is shown that commonly used classification models can be vulnerable to membership inference attacks. Therefore, private information can be inferred if some user data (e.g., medical record and credit card data) is used to training the model. Please refer to the survey (Hu et al., 2021) for a comprehensive summary about membership inference attack.

Model Inversion Attack. Model inversion attack (Fredrikson et al., 2014, 2015) aims to use the model’s output to infer the information of the input data which often contains sensitive and private information. For instance, in pharmacogenetics, machine learning models are used to guide medical treatments given the patient’s genotype and demographic information. However, it is revealed that there exists severe privacy risk because the patient’s genetic information can be disclosed given the model and the patient’s demographic information (Fredrikson et al., 2014). In facial recognition with neural networks, the images of people’s faces can be recovered given their names, the prediction confidence values, and the access to the model (Fredrikson et al., 2015). In (Zhang et al., 2020d), generative adversarial networks (GANs) are used to guide the inversion process of neural networks and reconstruct high-quality face images from face recognition classifiers. In a recent study, researchers found that the input data can be perfectly recovered through the gradient information of neural networks (Zhu and Han, 2020). It highlights the privacy risk in distributed learning where gradient information needs to be transmitted while people used to believe that it could preserve data privacy.

Property Inference Attack. Given the machine learning model, property inference attack aims to extract global properties of the training dataset or training algorithm that the machine learning models do not intent to share. One example is to infer the properties that only hold for a subset of the training data or a specific class of the training data. This type of attack might leak private statistical information about the population and the learned property can be used to exploit the vulnerability of an AI system.

Model Extraction. An adversary aims to extract the model information by querying the machine learning model in a black-box setting such that he can potentially fully reconstruct the model or create a substitute model that closely approximates the target model (Tramèr et al., 2016). Once the model has been extracted, the black-box setting translates to the white-box setting where other types of privacy attacks become much easier. Moreover, the model information typically contains the intelligent property that should be kept confidential. For instance, ML-as-a-service (MLaaS) systems, such as Amazon AWS Machine Learning, Microsoft Azure Machine Learning Studio, Google Cloud Machine Learning Engine, allow users to train the models on their data and provide publicly accessible query interfaces on a pat-per-query basis. The confidential model contains users’ intelligent property but suffers from the risk of functionality stealing.

6.2.2. Privacy Preservation

The privacy-preserving countermeasures can be roughly categorized into three mainstream and promising directions including confidential computing, federated learning, and differential privacy as shown in Figure 11. Confidential computing attempts to ensure data safety during transmission and computing. Federated learning provides a new machine learning framework which allows data to be local and decentralized and avoid raw data transmission. Differential privacy aims to utilize the information about a whole dataset without exposing individual information in the dataset. Next, we will review these techniques and discuss how they preserve privacy.

Figure 11. An Overview of Privacy Preserving Techniques

Confidential Computing. There are mainly three types of techniques for achieving confidential computing, including Trusted Executive Environment (TEE) (Sabt et al., 2015), Homomorphic Encryption (HE) (Acar et al., 2018) and Multi-party Secure Computation (MPC) (Evans et al., 2018).

Trusted Execution Environments. Trusted Execution Environments focus on developing hardware and software techniques to provide an environment that isolates data and programs from the operator system, virtual machine manager, and other privileged processes. The data is stored in the trusted execution environment (TEE) such that it is impossible to disclose or operate on the data from outside. The TEE guarantees that only authorized codes can access the protected data, and the TEE will deny the operation if the code is altered. As defined by the Confidential Computing Consortium (4), the TEE provides a level of assurance of data confidentiality, data integrity, and code integrity which essentially states that unauthorized entities cannot view, add, remove or alter the data while it is in use within the TEE and cannot add, remove or alter code executing in the TEE.

Secure Multi-party Computation. Secure multi-party computation (MPC) protocols aim to enable a group of data owners who might not trust each other to jointly perform a function computation that depends on all of their private input while without disclosing any participant’s private data. Although the concept of secure computation was primarily on theoretical interest when it was first proposed (Yao, 1982), it has now become a practical tool to enable privacy-preserving applications where multiple distrusting data owners seek to compute a function cooperatively (Evans et al., 2018).

Homomorphic Encryption. Homomorphic Encryption (HE) enables computation functions on the data without accessing the plaintext by allowing mathematical operations to be performed on ciphertext without decryption. It returns the computation result in the encrypted form which can be decrypted just as the computation is performed on the decrypted data. With partially homomorphic encryption schemes, only certain operations can be performed, which limits them to specialized problems that can be reduced as the supported operations. Fully-homomorphic encryption (FHE) schemes aim to provide support for a universal set of operations so that any finite function can be computed. The first FHE scheme was proposed by Gentry (Gentry, 2009), which is based on lattice-based cryptography. There have been a lot of recent interests in implementing FHE schemes (Gentry and Halevi, 2011; Chillotti et al., 2016), but to build a secure, deployable, scalable system using FHE is still challenging.

Federated Learning. Federated learning (FL), as shown in Figure 12, is a popular machine learning paradigm where many clients, such as mobile devices or sensors, collaboratively train machine learning models under the coordination of a central server, while keeping the training data from the clients decentralized (McMahan and others, 2021). This is in contrast with traditional machine learning settings where the data is first collected and transmitted to the central server for further processing. In federated learning, the machine learning models are moving between server and clients while keeping the private data locally within the clients. Therefore, it essentially avoids the transmission of private data and significantly reduces the risk of privacy breaches.

Figure 12. Federated Learning

Next, we briefly describe a typical workflow for a federated learning system (McMahan and others, 2021):

  • Client selection: The server samples a subset of clients from those active clients according to some eligibility requirements.

  • Broadcast: The server broadcasts the current model and the training program to the selected clients.

  • Local computation

    : The selected clients locally compute the update to the received model based on the local private data. For instance, stochastic gradient descent (SGD) update can be run with the stochastic gradient computed based on local data and the model.

  • Aggregation: The server collects the updated local models from the selected clients and aggregates them as an updated global model.

This workflow represents one round of the federated learning algorithm and it will repeat until reaching specific requirements such as the convergence accuracy or performance certificates.

In addition to protecting data privacy through keeping the data local, there are many other techniques to further secure data privacy. For instance, we can apply lossy compression before transferring the models between server and clients such that it is not easy for the adversary to infer accurate information from the model update (Zhu and Han, 2020). We also can apply secure aggregation through secure multi-party computation such that no participation knows the local model information from other participants, but the global model can still be computed (Mohassel and Zhang, 2017; Agrawal et al., 2019). Additionally, we can also apply noisy perturbation to improve the differential privacy (Wei et al., 2020).

Federated learning is becoming an increasingly popular paradigm for privacy protection and has been studied, developed, and deployed in many applications. However, federated learning still faces many challenges, such as the efficiency and effectiveness of learning especially with non-IID data distributions (Li et al., 2020b, a; Khaled et al., 2020; Karimireddy et al., 2020; Liu et al., 2021b).

Differential Privacy. Differential Privacy (DP) is an area of research which aims to provide rigorous statistical guarantees for reducing the disclosure about individual information in a dataset (Dwork, 2008; Dwork et al., 2014). The major idea is to introduce some level of uncertainty through randomization or noise into the data such that the contribution of individual information is hidden while the algorithm can still leverage valuable information from the dataset as a whole. According to the definition (Dwork et al., 2014), let’s first define that the datasets and are adjacent if can be obtained from by altering the record of a single individual. A randomized algorithm is -differentially private if for all and for all adjacent datasets and such that

quantifies how much information can be inferred about an individual from the output of the algorithm on the dataset. For instance, if and are sufficiently small, the output of the algorithm will be almost identical, i.e., , such that it is difficult for the adversary to infer the information of any individual since the individual’s contribution on the output of the algorithm is nearly masked. The privacy loss incurred by the observation is defined as

-differential privacy ensures that for all adjacent datasets and , the absolute value of the privacy loss is bounded by with probability at least . Some common methods to provide differential privacy include random response (Warner, 1965), Gaussian mechanism (Dwork et al., 2014), Laplace mechanism (Dwork et al., 2006), exponential mechanism (McSherry and Talwar, 2007), etc.

6.3. Applications in Real Systems

Privacy-preserving techniques have been widely used to protect sensitive information in real systems. In this subsection, we discuss some representative examples.

6.3.1. Healthcare

Healthcare data can be available from patients, clinical institutions, insurance companies, pharmacies, and so on. However, the privacy concern of personal healthcare information makes it difficult to fully exploit the large-scale and diverse healthcare data to develop effective predictive models for healthcare applications. Federated learning provides an effective privacy-preserving solution for such scenarios since the data across the population can be utilized while not being shared (120; G. A. Kaissis, M. R. Makowski, D. Rückert, and R. F. Braren (2020); J. Xu, B. S. Glicksberg, C. Su, P. Walker, J. Bian, and F. Wang (2021); N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albarqouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein, et al. (2020); F. Zerka, S. Barakat, S. Walsh, M. Bogowicz, R. T. Leijenaar, A. Jochems, B. Miraglio, D. Townend, and P. Lambin (2020); M. J. Sheller, B. Edwards, G. A. Reina, J. Martin, S. Pati, A. Kotrotsou, M. Milchenko, W. Xu, D. Marcus, R. R. Colen, et al. (2020)). Differential privacy has also gained significant attention as a general way for protecting healthcare data (Dankar and El Emam, 2013).

6.3.2. Biometric Data Analysis

Biometric data is mostly non-revocable and can be used for identification and authentication. Therefore, it is critical to protect private biometric data. To this end, confidential computing, federated learning, and differential privacy techniques become widely applied to protect people’s biometric data such as face image, medical image, and fingerprint pattern (Bringer et al., 2013; Kaissis et al., 2020; Sadeghi et al., 2009; Wang et al., 2020a).

6.3.3. Recommender Systems

Recommender systems utilize users’ interactions on products such as movies, music, and goods to provide relevant recommendations. The rating information has been shown to expose users to inference attacks, leaking private user attributes such as age, gender, etc (Shyong et al., 2006; Aïmeur et al., 2008; McSherry and Mironov, 2009). To protect user privacy, recent works (McSherry and Mironov, 2009; Kapralov and Talwar, 2013; Nikolaenko et al., 2013; Zhang et al., 2021) have developed privacy-preserving recommender systems via differential privacy.

6.3.4. Distributed Learning

In distributed learning, it is possible to recover a client’s original data from their gradient information or model update (Zhu and Han, 2020; Zhao et al., 2020). Differentially private SGD (Song et al., 2013; Abadi et al., 2016; Jagielski et al., 2020) provides an effective way to protect the input data by adding noise and has been popular in the training of deep learning models. Secure multi-party computing is applied to protect the information of locally trained models during aggregation (Mohassel and Zhang, 2017; Agrawal et al., 2019; Rouhani et al., 2018).

6.4. Surveys and Tools

We collect some surveys and tools relevant to privacy in AI systems for further exploration.

6.4.1. Surveys

The general concepts, threats, attack and defense methods in privacy-preserving machine learning are summarized in several surveys (Rigaki and Garcia, 2020; Al-Rubaie and Chang, 2019; De Cristofaro, 2020). Federated learning is comprehensively introduced in the papers (Yang et al., 2019; McMahan and others, 2021). Differential privacy is reviewed in the surveys (Dwork, 2008; Dwork et al., 2014; Ji et al., 2014).

6.4.2. Tools

Popular tools and repositories in federated learning include TensorFlow Federated (TFF) (310), FATE (119), FedML (He et al., 2020), PaddleFL (251) and LEAF (194). Popular tools in differential privacy include Facebook Opacus (249), TensorFlow-Privacy (311), OpenDP (246) and Diffpriv (Rubinstein and Alda, 2017). Keystone Enclave (Lee et al., 2020) is an open framework for architecting Trusted Execution Environments. Popular tools in Secure Multi-party Computing and Homomorphic Encryption are summarized in the lists (3; 2).

6.5. Future Directions

Confidential computing, federated learning, and differential privacy are three effective ways to improve privacy protection. However, they are far away from being extensively used and require more development. For instance, the computation efficiency and flexibility of confidential computing are not mature enough to support the applications of AI systems in our society. There are also great challenges for improving the efficiency and effectiveness of federated learning when deploying in large-scale and heterogeneous environments. It is desired to achieve a better trade-off between utility and privacy loss in differential privacy. Most importantly, a versatile and reliable system design for achieving privacy protection and different techniques should be integrated to enhance the trustworthiness of AI systems.

7. Accountability & Auditability

In general, accountability for AI indicates how much we can trust these AI technologies and who or what should we blame if any parts of the AI technologies perform out of expectation. It is about a declaration of responsibility. It is not trivial to explicitly determine the accountability for AI. On the one hand, most AI-based systems act as ”black-box”, due to the lack of explainability and transparency. On the other hand, real-world AI-based systems are very complex, which involve numerous key components including input data, algorithm theory, implementation details, real-time human control, and so on. These factors further complicate the determination of accountability for AI. Yet difficult and complex, it is necessary to guarantee accountability for AI. Auditability is one of the most important methodologies in guaranteeing accountability, which refers to a set of principled evaluations of the algorithm theories and implementation processes.

It is very important to achieve a balance between the accountability and innovation of AI. The overall aim is to make humans enjoy the benefits and conveniences from AI with a reliable and secure guarantee. Neither do we want to heavily burden the algorithm designer, nor we are going to put too many restrictions on end-users of AI-based systems. In this section, we discuss the accountability and auditability of AI. First, we introduce the basic concept of accountability and some key roles in accountability. Next, we describe the definition of auditability for AI and two kinds of audits. Finally, we summarize existing surveys and tools, and discuss some future directions to enhance the accountability and auditability for AI.

7.1. Concepts and Taxonomy

In this subsection, we will introduce the key concepts and taxonomies of accountability and auditability for AI.

7.1.1. Accountability

Accountability for AI has a broad definition. On the one hand, accountability can be interpreted as a property of AI. From this perspective, accountability can be improved if breakthroughs can be made in the explainability of AI algorithms. On the other hand, accountability can be referred to as a clear responsibility distribution, which focuses on who should take the responsibility for what impact of AI-based systems. Here we mainly focus on discussing the second notion. As indicated above, it is not trivial to give a clear specification for responsibility, since the operation of an AI-based system involves many different parties, such as the system designer, the system deployer, and the end-user. Any improper operation from any parties may result in system failure or potential risk. Also, all kinds of possible cases should be taken into consideration to make a fair responsibility distribution. For instance, the case when the AI system does harm when working correctly and the case when the AI system does harm when working incorrectly should be considered differently (Martin, 2019; Yu et al., 2018). To better specify accountability, it is necessary to determine the roles and the corresponding responsibility of different parties in the function of an AI system. In (Wieringa, 2020), three roles are proposed: decision-makers, developers, and users. By refining these three roles, we propose five roles, and introduce their responsibilities and obligations as follows: System Designers: system designers are the designers of the AI system. They are supposed to design an AI system which meets the user requirements and is transparent and explainable to the greatest extent. It is their responsibility to offer deployment instructions and user guidelines, and to release potential risks. Decision Makers: decision-makers have the right to determine whether to build an AI system and what AI system should be adopted. Decision-makers should be fully aware of the benefits and risks of the candidate AI system, and take all the relevant requirements and regulations into overall consideration. System Deployers: system deployers are in charge of deploying an AI system. They are supposed to follow the deployment instructions carefully and ensure that the system has been deployed appropriately. Also, they are expected to offer some hands-on tutorials to the end-users. System Auditors: system auditors are responsible for system auditing. They are expected to provide comprehensive and objective assessments for the AI system. End Users: end-users are the practical operators of an AI system. They are supposed to follow the user guidelines carefully and timely report emerging issues to system deployers and system designers.

7.1.2. Auditability

Auditability is one of the most important methodologies in ensuring accountability, which refers to a set of principled assessments from various aspects. In the IEEE standard for software development (1), an audit is defined as “an independent evaluation of conformance of software products and processes to applicable regulations, standards, guidelines, plans, specifications, and procedures”. Typically, audits can be divided into two categories as follows:

External audits: external audits (Green and Chen, 2019; Sandvig et al., 2014) refer to audits conducted by a third party which is independent of system designers and system deployers. The external audits are expected to share no common interest with the internal workers and are likely to provide some novel perspectives for auditing the AI system. Therefore, it is expected that external audits can offer a comprehensive and objective audit report. However, there exist obvious limitations in external audits. First, external audits typically cannot access some important internal data of the AI system, such as the model training data and model implementation details (Burrell, 2016), which increases the auditing difficulty. Also, external audits are always conducted after an AI system is deployed, so that it may be costly to make some adjustments over the system, and sometimes the system may have already done some harm (Moy, 2019).

Internal audits: internal audits refer to audits conducted by a group of people inside the system designer or system deployer organizations. SMACTR (Raji et al., 2020) is a recent internal auditing framework proposed by researchers from Google and Partnership on AI. It consists of five stages: scoping, mapping, artifact collection, testing, and reflection. Compared with external audits, internal audits can have access to a large amount of internal data, including the model training data and model implementation details, which makes internal audits much more convenient. Furthermore, internal audits can be conducted before an AI system is deployed, thus it can avoid some potential harm after the system deployment. Also, the internal audit report can serve as an important reference for the decision-maker to make a decision. However, an unavoidable shortcoming for internal audits is that they share the same interest with the audited party, which makes it challenging to give an objective audit report.

7.2. Surveys and Tools

In the subsection, we summarize existing surveys and tools about accountability and auditability of AI, to facilitate the readers to further explore this field.

7.2.1. Surveys

A recent work on algorithmic accountability is presented in (Wieringa, 2020). It takes Boven’s definition of accountability (Bovens, 2007) as the basic concept and combines it with numerous literature in algorithmic accountability to build the definition of algorithmic accountability.

7.2.2. Tools

The other five dimensions (safety & robustness, non-discrimination & fairness, explainability, privacy, environmental well-being) discussed in this survey are also important aspects to be evaluated during algorithm auditing. Therefore, most tools introduced in Section  3.6, 4.4.2, 5.4.2, 6.4, and 8.3 can also be used for the purpose of auditing.

7.3. Future Directions

For accountability, it is important to further enhance the explainability of the AI system. Only when we have a deep and thorough understanding of its theory and mechanism, can we fully rely on it or make a well-recognized responsibility distribution scheme. For auditability, it is always a good option to conduct both external audits and internal audits, so that we can have a comprehensive and objective overview of an AI system. Furthermore, we need to be aware that an AI system is constantly dynamic. It can change with input data and environment. Thus, to make an effective and timely audit, it is necessary to audit the system periodically and update auditing principles with the system changes (Lins et al., 2019).

8. Environmental Well-being

A trustworthy AI system should be sustainable and environmentally friendly (Smuha, 2019). In fact, the large-scale development and deployment of AI systems brings a huge burden of energy consumption, which inevitably affects the environment. For example, Table 5 shows the carbon emission (as an indicator of energy consumption) of training NLP models, and that of daily consumption (Strubell et al., 2019). We find that training a common NLP pipeline has the same carbon emissions as a human produces in 7 years. Training and fine-tuning a large Transformer model costs 5 times more energy consumption than a car in its lifetime. Besides model development, in other areas such as data center cooling666, there is also a huge energy cost. The rapid development of AI technology further challenges the tense global situation of energy shortage and environmental deterioration. Hence, environmental friendliness becomes an important issue to consider in building trustworthy AI. In this section, we review the existing works regarding the environmental impacts of AI technologies. Existing works mainly focus on the impact of energy consumption of AI systems on the environment. We first present an overview of the strategies for reducing energy consumption, e.g., model compression, and then we introduce the works on estimating the energy consumption and evaluating the environmental impacts of real-world AI systems in different domains. Finally, we summarize the existing surveys and tools on this dimension.

Consumption COe (lbs)
Air travel, 1 passenger, NYSF 1984
Human life, avg, 1 year 11,023
American life, avg, 1 year 36,156
Car, avg incl. fuel, 1 lifetime 126,000
Training one model (GPU)
NLP pipeline (parsing, SRL) 39
    w/ tuning & experimentation 78,468
Transformer (big) 192
    w/ neural architecture search 626,155
Table 5. Comparsions between estimated CO emissions produced by daily lives and training NLP models. (Table Credit: (Strubell et al., 2019))

8.1. Methods

In this subsection, we summarize the techniques developed for saving the energy use of AI algorithms. Improving the energy efficiency of AI systems involves algorithm-level and hardware-level solutions. We will introduce two common classes of algorithm-level approaches: model compression and adaptive design, as well as hardware-level energy-saving methods.

8.1.1. Model Compression

Model compression is a hot topic in deep learning which receives continuous attention from both academia and industry. It studies how to reduce the size of a deep model to save storage space and the energy consumption for training and deploying models, with an acceptable sacrifice on model performance. For CNN models in the image domain, parameter pruning and quantization (Han et al., 2015; Choi et al., 2016), low-rank factorization (Rigamonti et al., 2013; Jaderberg et al., 2014), transferred/compact convolutional filters (Cohen and Welling, 2016; Shang et al., 2016), and knowledge distillation have been proposed (Hinton et al., 2015; Romero et al., 2014). Similarly, in the text domain, researchers borrow and extend these methods: pruning (Cao et al., 2019; Michel et al., 2019), quantization (Cheong and Daniel, 2019; Hou and Kwok, 2018), knowledge distillation (Kim and Rush, 2016; Sun et al., 2019; Tang et al., 2019), parameter sharing (Dehghani et al., 2018; Lan et al., 2019), to compress popular NLP models, such as Transformer and BERT.

8.1.2. Adaptive Design

Another line of research focuses on adaptively designing a model architecture to optimize the energy efficiency of a model. Yang et al. (2017) propose a pruning approach to design CNN architectures to achieve an energy-saving goal. In their method, the model is pruned in a layer-by-layer manner, where the layer that consumes the most energy is pruned first. Stamoulis et al. (2018)

propose a framework to adaptively design CNN models for image classification under energy consumption restrictions. They formulate the design of a CNN architecture as a hyperparameter optimization problem and solve it by Bayesian optimization.

8.1.3. Hardware

In addition to the algorithm level, endeavors are also conducted to improve the energy efficiency of AI from the design of the hardware. Computing devices or platforms specially designed for AI applications are proposed to maximize the training and inference efficiency of AI algorithms. Specifically, hardware designed for DNN models are called DNN accelerators (Chen et al., 2020c). Esmaeilzadeh et al. (2012)

design a neural processing unit (NPU) to execute the fixed computations of a neuron such as multiplication, accumulation, and sigmoid, on chips. Later,

Liu et al. (2015) proposed RENO, which is a more advanced on-chip architecture for neural network acceleration. There is also hardware designed for specific NN structures. Han et al. (2016) investigate how to design an efficient computation device for a sparse neural network, where weight matrices and feature maps are sparse. They (Han et al., 2017) also devise an efficient speech recognition engine that is dedicated to RNN models. Furthermore, ReGAN (Chen et al., 2018a) is developed to accelerate generative adversarial networks (GANs).

8.2. Applications in Real Systems

As described before, the environmental impacts of AI systems mainly come from energy consumption. In this subsection, we introduce the research works on evaluating and estimating the energy consumption of real-world AI systems in different domains.

In the field of computer vision, Li et al. (2016) first investigate the energy use of CNNs on image classification tasks. They provide a detailed comparison among different types of CNN layers, and also analyze the impact of hardware on energy consumption. Cai et al. (2017) introduce the framework NeuralPower which can estimate the power and runtime across different layers in a CNN, to help developers to understand the energy efficiency of their models before deployment. They also propose to evaluate CNN models with a novel metric “energy-precision ratio”. Based on it, developers can trade off energy consumption and model performance according to their own needs, and choose the appropriate CNN architecture. As for the field of NLP, Strubell et al. (2019) examine the carbon emissions of training popular NLP models, namely, Transformer, ELMo, and GPT-2, on different types of hardware, and shed light on the potential environmental impacts of NLP research and applications.

8.3. Surveys and Tools

In the subsection, we collect related surveys and tools on the dimension of environmental well-being.

8.3.1. Surveys

From the algorithm-level perspective, García-Martín et al. (2019) present a comprehensive survey on energy consumption estimation methods from both the computer architecture and machine learning communities. Mainly, they provide a taxonomy for the works in computer architecture and analyze the strengths and weaknesses of the methods in various categories. Cheng et al. (2017) summarize the common model compression techniques and organize them into four categories, and present detailed analysis on the performance, application scenarios, advantages and disadvantages of each category. As for the hardware-level perspective, Wang et al. (2020b) compare the performance and energy consumption of the processors from different vendors for AI training. Mittal and Vetter (2014) review the approaches for analyzing and improving GPU energy efficiency. The survey (Chen et al., 2020c) summarizes the latest progress on DNN accelerator design.

8.3.2. Tools

SyNERGY (Rodrigues et al., 2018)

is a framework integrated with Caffe for measuring and predicting the energy consumption of CNNs.

Lacoste et al. (2019) develop a Machine Learning Emissions Calculator as a tool to quantitatively estimate the carbon emissions of training an ML model, which can enable researchers and practitioners to better understand the environmental impacts caused by their models. Accelergy (Wu et al., 2019) and Timeloop (Parashar et al., 2019) are two representative energy estimation tools for DNN accelerators.

8.4. Future Directions

Research on reducing the energy consumption of AI systems for environmental well-being is on the rise. At the algorithmic level, automated machine learning (AutoML), which aims to automatically design effective and efficient model architectures for certain tasks, emerges as a novel direction in the AI community. Existing works in AutoML focus more on designing an algorithm to improve its performance, but don’t usually treat energy consumption saving as the highest priority. Using AutoML technologies to design energy-saving models needs further explorations in the future. At the hardware level, current research on DNN accelerators pays more attention to devising efficient deployment devices to facilitate model inference, but the procedure of model training is overlooked. The design of efficient customized training devices for various DNN models is a practical and promising direction to investigate in the future.

9. Interactions among Different Dimensions

An ideal trustworthy AI system should simultaneously satisfy the six dimensions discussed above. In reality, the six dimensions are not independent of each other. The satisfaction of one dimension can promote the pursuit of another dimension. Meanwhile, there exist conflicts among different dimensions. The realization of one dimension could violate another dimension, which makes it impossible for two or more dimensions to be met simultaneously in some scenarios. Researchers and practitioners should be aware of the complicated interactions among different dimensions. Knowing the accordance between two dimensions brings us an alternative idea to achieve one dimension: we can try to satisfy one dimension by realizing the other. Moreover, when two dimensions are contradictory, we can make a trade-off between them according to our needs. In this section, we discuss some known accordance and conflict interactions among different dimensions.

9.1. Accordance

Two dimensions are accordant when the satisfaction of one dimension can facilitate the achievement of the other, or the two dimensions promote each other. Next, we show two examples of accordance interactions among dimensions.

Robustness & Explainability. Studies show that deep learning models’ robustness towards adversarial attacks positively correlates with their explainability. Etmann et al. (2019) find that models trained with robustness objectives show more interpretable saliency maps. Specifically, they prove rigorously in mathematics that Lipschitz regularization, which is commonly used for robust training, forces the gradients to align with the inputs. Noack et al. (2021) further investigate the opposite problem: will an interpretable model be more robust? They propose Interpretation Regularization (IR) to train models with explainable gradients, and empirically show that a model can be more robust to adversarial attacks if it is trained to produce explainable gradients.

Fairness & Environmental Well-being. Fairness in the field of AI is a broad topic, which involves not only the fairness of AI service providers and users, but also the equality of AI researchers. As mentioned in Section 8, the development trend of deep learning models towards larger models and more computing resource consumption not only causes adverse environmental impact, but also aggravates the inequality of research (Strubell et al., 2019), since most researchers cannot afford high-performance computing devices. Hence, the efforts for ensuring the environmental well-being of AI techniques, such as reducing the cost of training large AI models, are in accordance with the fairness principle of trustworthy AI.

9.2. Conflict

Two dimensions are conflicting when the satisfaction of one dimension hinders the realization of the other. Next, we show three examples of the conflicting interactions among dimensions.

Robustness & Privacy. Recent studies find tensions between the robustness and the privacy requirements of trustworthy AI. Song et al. (Song et al., 2019) checks how the robust training towards adversarial attacks influences the risk of a model against membership inference attack. They find that models trained with adversarial defense approaches are more likely to expose sensitive information in training data via membership inference attacks. The reason behind it is that models trained to be robust to adversarial examples typically overfit to training data, which makes training data easier to be detected from models’ outputs.

Robustness & Fairness. Robustness and fairness can also conflict with each other in particular scenarios. As discussed in Section 3, adversarial training is one of the mainstream approaches for improving the robustness of a deep learning model. Recent research (Xu et al., 2020a) indicates that adversarial training can introduce a significant disparity of performance and robustness among different groups, even if the datasets are balanced. Thus, the adversarial training algorithm improves the robustness of a model at the expense of its fairness. Accordingly, the work (Xu et al., 2020a) proposes a framework called Fair-Robust-Learning (FRL) to ensure fairness while improving a model’s robustness.

Fairness & Privacy. Cummings et al. (Cummings et al., 2019) investigate the compatibility of fairness and privacy of classification models, and theoretically prove that differential privacy and exact fairness in terms of equal opportunity are unlikely to be achieved simultaneously. By relaxing the condition, this work further shows that it is possible to find a classifier that satisfies both differential privacy and approximate fairness.

10. Future Directions

In this survey, we elaborated on six of the most concerning and crucial dimensions an AI system needs to meet to be trustworthy. Beyond that, some dimensions have not received extensive attention, but are worth exploring in the future. In this section, we will discuss several other potential dimensions of trustworthy AI.

10.1. Human agency and oversight

The ethical guidelines for trustworthy AI proposed by different countries and regions all emphasize the human autonomy principle of AI technology (Smuha, 2019). Human autonomy prohibits AI agents from subordinating, coercing, or manipulating humans, and requires humans to keep self-determination over themselves. To achieve the principle of human autonomy, the design of AI systems should be human-centered. Specifically, human agency and oversight should be guaranteed in the development and deployment of AI systems. Human agency enables humans to make decisions independently based on the outputs of an AI system, instead of being totally subject to AI’s decisions. A desirable human agency technology encourages users to understand the mechanism of an AI system and enables users to evaluate and challenge the decisions of an AI system, and make better choices by themselves. Human oversight enables humans to oversee AI systems in their whole life cycle, from design to usage. It can be achieved through human-in-the-loop, human-on-the-loop, and human-in-command governance strategies.

10.2. Creditability

With the wide deployment of AI systems, people increasingly rely on content produced or screened by AI, such as an answer to a question given by a question-answering (QA) agent or a piece of news delivered by a recommender system. However, the integrity of such content is not always guaranteed. For example, an AI system that exposes users to misinformation should not be considered trustworthy. Hence, additional mechanisms and approaches should be incorporated in AI systems to ensure their creditability.

10.3. Interactions among Different Dimensions

As discussed in Section 9, different dimensions of trustworthy AI can interact with each other in an accordant or conflicting manner. However, the research on the interactions among different dimensions is still in an early stage. Besides the several instances shown in this paper, there are potential interactions between other dimension pairs remaining to be investigated. For example, people may be interested in the relationship between fairness and interpretability. In addition, the interaction formed between two dimensions can be different in different scenarios, which needs more exploration. For example, an interpretable model may promote its fairness by making its decision process transparent. On the contrary, techniques to improve the interpretability of a model may introduce a disparity of interpretability among different groups, which leads to a fairness problem. Although there are lots of problems to study, understanding the interactions among different dimensions is very important for us to build a trustworthy AI system.

11. Conclusion

In this survey, we present a comprehensive survey of trustworthy AI from a computational perspective. We clarify the definition of trustworthy AI from multiple perspectives and distinguish it from similar concepts. We introduce six of the most crucial dimensions which make an AI system trustworthy; namely, Safety & Robustness, Non-discrimination & Fairness, Explainability, Accountability & Auditability, Privacy, and Environmental Well-Being. For each dimension, we present an overview of related concepts and a taxonomy to help readers understand from a wide view how each dimension is studied, and summarize the representative technologies, to enable readers to follow the latest research progress in each dimension. To further deepen the understanding of each dimension, we provide plenty of examples of applications in real-world systems and summarize existing related surveys and tools. We also discuss potential future research directions within each dimension. Afterwards, we analyze the accordance and conflicting interactions among different dimensions. Finally, it is important to mention that outside of the six dimensions elaborated in this survey, there still exist some other potential issues which may undermine our trust in AI systems. Hence, we discuss several possible dimensions of trustworthy AI as future research directions.


  • [1] (2008) 1028-2008 - ieee standard for software reviews and audits. In 1028-2008 - IEEE Standard for Software Reviews and Audits, Cited by: §7.1.2.
  • [2] (2021) A list of homomorphic encryption libraries, software or resources. Note: Cited by: §6.4.2.
  • [3] (2021) A list of mpc software or resources. Note: Cited by: §6.4.2.
  • [4] (2021) A technical analysis of confidential computing. Note: Accessed Jan, 2021 Cited by: §6.2.2.
  • M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318. Cited by: §6.3.4.
  • A. Acar, H. Aksu, A. S. Uluagac, and M. Conti (2018) A survey on homomorphic encryption schemes: theory and implementation. ACM Computing Surveys (CSUR) 51 (4), pp. 1–35. Cited by: §6.2.2.
  • A. Adadi and M. Berrada (2018) Peeking inside the black-box: a survey on explainable artificial intelligence (xai). IEEE access 6, pp. 52138–52160. Cited by: 4th item.
  • T. Adel, I. Valera, Z. Ghahramani, and A. Weller (2019) One-network adversarial fairness. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 2412–2420. Cited by: Table 2.
  • P. Adler, C. Falk, S. A. Friedler, T. Nix, G. Rybeck, C. Scheidegger, B. Smith, and S. Venkatasubramanian (2018) Auditing black-box models for indirect influence. Knowledge and Information Systems 54 (1), pp. 95–122. Cited by: Table 2.
  • A. Agarwal, M. Dudik, and Z. S. Wu (2019) Fair regression: quantitative definitions and reduction-based algorithms. In International Conference on Machine Learning, pp. 120–129. Cited by: §4.3.1, Table 3.
  • S. Aghaei, M. J. Azizi, and P. Vayanos (2019) Learning optimal and fair decision trees for non-discriminative decision-making. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 1418–1426. Cited by: Table 2.
  • N. Agrawal, A. Shahin Shamsabadi, M. J. Kusner, and A. Gascón (2019) QUOTIENT: two-party secure neural network training and prediction. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 1231–1247. Cited by: §6.2.2, §6.3.4.
  • E. Aïmeur, G. Brassard, J. M. Fernandez, and F. S. M. Onana (2008) A lambic: a privacy-preserving recommender system for electronic commerce. International Journal of Information Security 7 (5), pp. 307–334. Cited by: §6.3.3.
  • N. Akhtar and A. Mian (2018) Threat of adversarial attacks on deep learning in computer vision: a survey. Ieee Access 6, pp. 14410–14430. Cited by: §3.6.1.
  • M. Al-Rubaie and J. M. Chang (2019) Privacy-preserving machine learning: threats and solutions. IEEE Security & Privacy 17 (2), pp. 49–58. Cited by: §1, §6.4.1.
  • M. Ancona, E. Ceolini, C. Öztireli, and M. Gross (2018) Towards better understanding of gradient-based attribution methods for deep neural networks. In International Conference on Learning Representations, Cited by: §5.4.2.
  • A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, et al. (2020) Explainable artificial intelligence (xai): concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion 58, pp. 82–115. Cited by: §5.4.1, Table 4.
  • V. Arya, R. K. Bellamy, P. Chen, A. Dhurandhar, M. Hind, S. C. Hoffman, S. Houde, Q. V. Liao, R. Luss, A. Mojsilovic, et al. (2020) Ai explainability 360: an extensible toolkit for understanding data and machine learning models. Journal of Machine Learning Research 21 (130), pp. 1–6. Cited by: §5.4.2.
  • A. Backurs, P. Indyk, K. Onak, B. Schieber, A. Vakilian, and T. Wagner (2019) Scalable fair clustering. In International Conference on Machine Learning, pp. 405–413. Cited by: §4.3.1, Table 3.
  • M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar (2010) The security of machine learning. Machine Learning 81 (2), pp. 121–148. Cited by: §1.
  • C. R. S. Basta, M. Ruiz Costa-Jussà, and J. A. Rodríguez Fonollosa (2020)

    Towards mitigating gender bias in a decoder-based neural machine translation model by adding contextual information

    In Proceedings of the The Fourth Widening Natural Language Processing Workshop, pp. 99–102. Cited by: Table 3.
  • O. Bastani, X. Zhang, and A. Solar-Lezama (2019) Probabilistic verification of fairness properties via concentration. Proceedings of the ACM on Programming Languages 3 (OOPSLA), pp. 1–27. Cited by: Table 2.
  • R. K. Bellamy, K. Dey, M. Hind, S. C. Hoffman, S. Houde, K. Kannan, P. Lohia, J. Martino, S. Mehta, A. Mojsilovic, et al. (2018) AI fairness 360: an extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. arXiv preprint arXiv:1810.01943. Cited by: §4.4.2.
  • V. Belle and I. Papantonis (2020) Principles and practice of explainable machine learning. arXiv preprint arXiv:2009.11698. Cited by: §5.4.1, Table 4.
  • R. Berk, H. Heidari, S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth (2017) A convex framework for fair regression. arXiv preprint arXiv:1706.02409. Cited by: §4.3.1, Table 3.
  • R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth (2021) Fairness in criminal justice risk assessments: the state of the art. Sociological Methods & Research 50 (1), pp. 3–44. Cited by: 1st item.
  • B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli (2013) Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Cited by: §3.2, §3.
  • B. Biggio, B. Nelson, and P. Laskov (2012)

    Poisoning attacks against support vector machines

    arXiv preprint arXiv:1206.6389. Cited by: §3.1.1, §3.2.
  • C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: 1st item.
  • S. L. Blodgett, S. Barocas, H. Daumé III, and H. Wallach (2020) Language (technology) is power: a critical survey of “bias” in nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5454–5476. Cited by: 2nd item, §4.4.1.
  • A. Bojcheski and S. Günnemann (2018) Adversarial attacks on node embeddings. arXiv preprint arXiv:1809.01093. Cited by: §3.5.4.
  • A. Bojchevski and S. Günnemann (2019) Adversarial attacks on node embeddings via graph poisoning. External Links: 1809.01093 Cited by: §3.2.
  • T. Bolukbasi, K. Chang, J. Zou, V. Saligrama, and A. Kalai (2016a) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. arXiv preprint arXiv:1607.06520. Cited by: Table 3.
  • T. Bolukbasi, K. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai (2016b) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems, pp. 4349–4357. Cited by: §4.3.3.
  • S. Bordia and S. R. Bowman (2019a) Identifying and reducing gender bias in word-level language models. arXiv preprint arXiv:1904.03035. Cited by: Table 3.
  • S. Bordia and S. R. Bowman (2019b) Identifying and reducing gender bias in word-level language models. arXiv preprint arXiv:1904.03035. Cited by: §4.3.3.
  • D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman (2019) Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference, pp. 491–500. Cited by: Table 3.
  • A. J. Bose and W. L. Hamilton (2019) Compositional fairness constraints for graph embeddings. arXiv preprint arXiv:1905.10674. Cited by: §4.3.5, Table 3.
  • M. Bovens (2007) Analysing and assessing accountability: a conceptual framework 1. European law journal 13 (4), pp. 447–468. Cited by: §7.2.1.
  • J. Bringer, H. Chabanne, and A. Patey (2013) Privacy-preserving biometric identification using secure multiparty computation: an overview and recent trends. IEEE Signal Processing Magazine 30 (2), pp. 42–52. Cited by: §6.3.2.
  • M. Brundage, S. Avin, J. Wang, H. Belfield, G. Krueger, G. Hadfield, H. Khlaaf, J. Yang, H. Toner, R. Fong, et al. (2020) Toward trustworthy ai development: mechanisms for supporting verifiable claims. arXiv preprint arXiv:2004.07213. Cited by: §1, §1.
  • M. Brunet, C. Alkalay-Houlihan, A. Anderson, and R. Zemel (2019) Understanding the origins of bias in word embeddings. In International Conference on Machine Learning, pp. 803–811. Cited by: Table 3.
  • B. G. Buchanan (2005) A (very) brief history of artificial intelligence. Ai Magazine 26 (4), pp. 53–53. Cited by: §2.
  • J. Bughin, J. Seong, J. Manyika, M. Chui, and R. Joshi (2018) Notes from the ai frontier: modeling the impact of ai on the world economy. McKinsey Global Institute. Cited by: §1.
  • J. Buolamwini and T. Gebru (2018) Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pp. 77–91. Cited by: §4.3.2, Table 3.
  • J. Burrell (2016) How the machine ‘thinks’: understanding opacity in machine learning algorithms. Big Data & Society 3 (1), pp. 2053951715622512. Cited by: §7.1.2.
  • E. Cai, D. Juan, D. Stamoulis, and D. Marculescu (2017) Neuralpower: predict and deploy energy-efficient convolutional neural networks. In Asian Conference on Machine Learning, pp. 622–637. Cited by: §8.2.
  • T. Calders, F. Kamiran, and M. Pechenizkiy (2009) Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, pp. 13–18. Cited by: Table 3.
  • T. Calders and S. Verwer (2010)

    Three naive bayes approaches for discrimination-free classification

    Data Mining and Knowledge Discovery 21 (2), pp. 277–292. Cited by: Table 3.
  • T. Calders and I. Žliobaitė (2013) Why unbiased computational processes can lead to discriminative decision procedures. In Discrimination and privacy in the information society, pp. 43–57. Cited by: Table 2.
  • S. Cao, C. Zhang, Z. Yao, W. Xiao, L. Nie, D. Zhan, Y. Liu, M. Wu, and L. Zhang (2019) Efficient and effective sparse lstm on fpga with bank-balanced sparsity. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 63–72. Cited by: §8.1.1.
  • N. Carlini and D. Wagner (2017a) Adversarial examples are not easily detected: bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. Cited by: §3.4.3.
  • N. Carlini and D. Wagner (2017b) Towards evaluating the robustness of neural networks. External Links: 1608.04644 Cited by: §3.1.1.
  • N. Carlini and D. Wagner (2018) Audio adversarial examples: targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. Cited by: §3.5.3.
  • Y. Carmon, A. Raghunathan, L. Schmidt, P. Liang, and J. C. Duchi (2019) Unlabeled data improves adversarial robustness. External Links: 1905.13736 Cited by: §3.4.1.
  • S. Carty (2011) Many cars tone deaf to women’s voices. AOL Autos. Cited by: §4.3.4, Table 3.
  • S. Caton and C. Haas (2020) Fairness in machine learning: a survey. arXiv preprint arXiv:2010.04053. Cited by: §4.2, §4.4.1.
  • L. E. Celis and V. Keswani (2019) Improved adversarial learning for fair classification. arXiv preprint arXiv:1901.10443. Cited by: Table 2.
  • L. Celis, A. Deshpande, T. Kathuria, and N. Vishnoi (2016) How to be fair and diverse?. ArXiv abs/1610.07183. Cited by: §4.2.
  • A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay (2018) Adversarial attacks and defences: a survey. arXiv preprint arXiv:1810.00069. Cited by: §3.6.1.
  • F. Chen, L. Song, and Y. Chen (2018a) Regan: a pipelined reram-based accelerator for generative adversarial networks. In 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 178–183. Cited by: §8.1.3.
  • H. Chen, H. Zhang, D. Boning, and C. Hsieh (2019a) Robust decision trees against adversarial examples. In International Conference on Machine Learning, pp. 1122–1131. Cited by: §3.2.
  • I. Chen, F. D. Johansson, and D. Sontag (2018b) Why is my classifier discriminatory?. arXiv preprint arXiv:1805.12002. Cited by: Table 2.
  • J. Chen, H. Dong, X. Wang, F. Feng, M. Wang, and X. He (2020a) Bias and debias in recommender system: a survey and future directions. arXiv preprint arXiv:2010.03240. Cited by: 2nd item, §4.4.1.
  • J. Chen, Y. Wu, X. Xu, Y. Chen, H. Zheng, and Q. Xuan (2018c) Fast gradient attack on network embedding. External Links: 1809.02797 Cited by: §3.2.
  • S. Chen, N. Carlini, and D. Wagner (2020b) Stateful detection of black-box adversarial attacks. In Proceedings of the 1st ACM Workshop on Security and Privacy on Artificial Intelligence, pp. 30–39. Cited by: §3.4.3.
  • X. Chen, B. Fain, L. Lyu, and K. Munagala (2019b) Proportionally fair clustering. In International Conference on Machine Learning, pp. 1032–1041. Cited by: §4.3.1, Table 3.
  • X. Chen, C. Liu, B. Li, K. Lu, and D. Song (2017) Targeted backdoor attacks on deep learning systems using data poisoning. External Links: 1712.05526 Cited by: §3.1.1, §3.3.2, §3.5.1, §3.
  • Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang (2020c) A survey of accelerator architectures for deep neural networks. Engineering 6 (3), pp. 264–274. Cited by: §8.1.3, §8.3.1.
  • M. Cheng, J. Yi, H. Zhang, P. Chen, and C. Hsieh (2018) Seq2sick: evaluating the robustness of sequence-to-sequence models with adversarial examples. arXiv preprint arXiv:1803.01128. Cited by: §3.2.
  • Y. Cheng, D. Wang, P. Zhou, and T. Zhang (2017) A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282. Cited by: §8.3.1.
  • R. Cheong and R. Daniel (2019) Transformers. zip: compressing transformers with pruning and quantization. Technical report Technical report, Stanford University, Stanford, California. Cited by: §8.1.1.
  • S. Chiappa (2019) Path-specific counterfactual fairness. In AAAI, Cited by: Table 2.
  • I. Chillotti, N. Gama, M. Georgieva, and M. Izabachene (2016) Faster fully homomorphic encryption: bootstrapping in less than 0.1 seconds. In international conference on the theory and application of cryptology and information security, pp. 3–33. Cited by: §6.2.2.
  • W. I. Cho, J. W. Kim, S. M. Kim, and N. S. Kim (2019) On measuring gender bias in translation of gender-neutral pronouns. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp. 173–181. Cited by: Table 3.
  • Y. Choi, M. El-Khamy, and J. Lee (2016) Towards the limit of network quantization. arXiv preprint arXiv:1612.01543. Cited by: §8.1.1.
  • A. Chouldechova (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big data 5 2, pp. 153–163. Cited by: §4.3.1.
  • A. Chouldechova and M. G’Sell (2017) Fairer and more accurate, but for whom?. arXiv preprint arXiv:1707.00046. Cited by: Table 2.
  • J. Cohen, E. Rosenfeld, and Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310–1320. Cited by: §3.4.2.
  • R. Cohen and D. Ruths (2013) Classifying political orientation on twitter: it’s not easy!. In Seventh international AAAI conference on weblogs and social media, Cited by: §4.1.1.
  • T. Cohen and M. Welling (2016) Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. Cited by: §8.1.1.
  • E. H. A. Commission et al. (2019) Independent high-level expert group on artificial intelligence (2019). Ethics guidelines for trustworthy AI. Cited by: §2, §2.
  • S. Corbett-Davies and S. Goel (2018) The measure and mismeasure of fairness: a critical review of fair machine learning. arXiv preprint arXiv:1808.00023. Cited by: §4.4.1.
  • S. Corbett-Davies, E. Pierson, A. Feller, S. Goel, and A. Huq (2017) Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pp. 797–806. Cited by: 1st item.
  • B. Cowgill and C. Tucker (2017) Algorithmic bias: a counterfactual perspective. NSF Trustworthy Algorithms. Cited by: Table 2.
  • F. Croce and M. Hein (2020a) Minimally distorted adversarial examples with a fast adaptive boundary attack. External Links: 1907.02044 Cited by: §3.3.1.
  • F. Croce and M. Hein (2020b) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. External Links: 2003.01690 Cited by: §3.3.1.
  • R. Cummings, V. Gupta, D. Kimpara, and J. Morgenstern (2019) On the compatibility of privacy and fairness. In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization, pp. 309–315. Cited by: §9.2.
  • A. C. Curry, J. Robertson, and V. Rieser (2020) Conversational assistants and gender stereotypes: public perceptions and desiderata for voice personas. In Proceedings of the Second Workshop on Gender Bias in Natural Language Processing, pp. 72–78. Cited by: Table 3.
  • E. Dai and S. Wang (2021) Say no to the discrimination: learning fair graph neural networks with limited sensitive attribute information. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 680–688. Cited by: §4.3.5, Table 3.
  • H. Dai, H. Li, T. Tian, X. Huang, L. Wang, J. Zhu, and L. Song (2018) Adversarial attack on graph structured data. External Links: 1806.02371 Cited by: §3.2.
  • N. Dalvi, P. Domingos, S. Sanghai, D. Verma, et al. (2004) Adversarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 99–108. Cited by: §3.2.
  • M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis, B. Kawas, and P. Sen (2020) A survey of the state of explainable ai for natural language processing. arXiv preprint arXiv:2010.00711. Cited by: §5.3.3, §5.4.1, Table 4.
  • F. K. Dankar and K. El Emam (2013) Practicing differential privacy in health care: a review.. Trans. Data Priv. 6 (1), pp. 35–67. Cited by: §6.3.1.
  • E. De Cristofaro (2020) An overview of privacy in machine learning. arXiv preprint arXiv:2005.08679. Cited by: §6.4.1.
  • M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser (2018) Universal transformers. arXiv preprint arXiv:1807.03819. Cited by: §8.1.1.
  • L. Deng and Y. Liu (2018) Deep learning in natural language processing. Springer. Cited by: §5.3.3.
  • M. J. Denny and A. Spirling (2016) Assessing the consequences of text preprocessing decisions. Available at SSRN. Cited by: §4.1.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1.
  • E. Dinan, A. Fan, A. Williams, J. Urbanek, D. Kiela, and J. Weston (2020) Queens are powerful too: mitigating gender bias in dialogue generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8173–8188. Cited by: Table 3.
  • L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman (2018) Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 67–73. Cited by: Table 3.
  • F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: 2nd item, §5.1.1, §5.4.1, Table 4, §5.
  • M. Du, N. Liu, and X. Hu (2019) Techniques for interpretable machine learning. Communications of the ACM 63 (1), pp. 68–77. Cited by: §5.4.1, Table 4.
  • C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226. Cited by: §4.1.2, §4.1.2.
  • C. Dwork, F. McSherry, K. Nissim, and A. Smith (2006) Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265–284. Cited by: §6.2.2.
  • C. Dwork, A. Roth, et al. (2014) The algorithmic foundations of differential privacy.. Foundations and Trends in Theoretical Computer Science 9 (3-4), pp. 211–407. Cited by: §6.2.2, §6.4.1.
  • C. Dwork (2008) Differential privacy: a survey of results. In International conference on theory and applications of models of computation, pp. 1–19. Cited by: §6.2.2, §6.4.1.
  • D. Ensign, S. A. Friedler, S. Nevlle, C. Scheidegger, and S. Venkatasubramanian (2018) Decision making with limited feedback: error bounds for predictive policing and recidivism prediction. In Proceedings of Algorithmic Learning Theory,, Vol. 83. Cited by: Table 2.
  • H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger (2012) Neural acceleration for general-purpose approximate programs. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 449–460. Cited by: §8.1.3.
  • C. Etmann, S. Lunz, P. Maass, and C. Schönlieb (2019) On the connection between adversarial robustness and saliency map interpretability. arXiv preprint arXiv:1905.04172. Cited by: §1, §9.1.
  • D. Evans, V. Kolesnikov, and M. Rosulek (2018) Vol. . External Links: Document Cited by: §6.2.2, §6.2.2.
  • K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song (2017) Robust physical-world attacks on deep learning models. arXiv preprint arXiv:1707.08945. Cited by: §3.5.1.
  • W. Fan, T. Derr, Y. Ma, J. Wang, J. Tang, and Q. Li (2019a) Deep adversarial social recommendation. In 28th International Joint Conference on Artificial Intelligence (IJCAI-19), pp. 1351–1357. Cited by: §5.3.1.
  • W. Fan, T. Derr, Y. Ma, J. Wang, J. Tang, and Q. Li (2019b) Deep adversarial social recommendation. In 28th International Joint Conference on Artificial Intelligence (IJCAI-19), pp. 1351–1357. Cited by: §5.3.1.
  • W. Fan, T. Derr, X. Zhao, Y. Ma, H. Liu, J. Wang, J. Tang, and Q. Li (2021) Attacking black-box recommendations via copying cross-domain user profiles. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 1583–1594. Cited by: §3, §5.3.1.
  • W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin (2019c) Graph neural networks for social recommendation. In The World Wide Web Conference, pp. 417–426. Cited by: §5.3.1.
  • W. Fan, Y. Ma, Q. Li, J. Wang, G. Cai, J. Tang, and D. Yin (2020) A graph neural network framework for social recommendations. IEEE Transactions on Knowledge and Data Engineering. Cited by: §5.3.1.
  • M. Fang, G. Yang, N. Z. Gong, and J. Liu (2018) Poisoning attacks to graph-based recommender systems. In Proceedings of the 34th Annual Computer Security Applications Conference, pp. 381–392. Cited by: §3.
  • [119] (2021) Federated ai technology enabler. Note: Cited by: §6.4.2.
  • [120] (2018) Federated learning of predictive models from federated electronic health records. International Journal of Medical Informatics 112, pp. 59–67. External Links: ISSN 1386-5056, Document Cited by: §6.3.1.
  • M. Feldman, S. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015a) Certifying and removing disparate impact. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Cited by: §4.2.
  • M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015b) Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 259–268. Cited by: Table 2.
  • R. Feng, Y. Yang, Y. Lyu, C. Tan, Y. Sun, and C. Wang (2019) Learning fair representations via an adversarial framework. arXiv preprint arXiv:1904.13341. Cited by: Table 2.
  • S. G. Finlayson, J. D. Bowers, J. Ito, J. L. Zittrain, A. L. Beam, and I. S. Kohane (2019) Adversarial attacks on medical machine learning. Science 363 (6433), pp. 1287–1289. Cited by: §3.
  • L. Floridi, J. Cowls, M. Beltrametti, R. Chatila, P. Chazerand, V. Dignum, C. Luetge, R. Madelin, U. Pagallo, F. Rossi, et al. (2018) AI4People—an ethical framework for a good ai society: opportunities, risks, principles, and recommendations. Minds and Machines 28 (4), pp. 689–707. Cited by: 1st item.
  • W. E. Forum (2020) The future of jobs report 2020. Cited by: §1, §5.
  • M. Fredrikson, S. Jha, and T. Ristenpart (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1322–1333. Cited by: §6.2.1.
  • M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart (2014) Privacy in pharmacogenetics: an end-to-end case study of personalized warfarin dosing. In 23rd USENIX Security Symposium (USENIX Security 14), pp. 17–32. Cited by: §6.2.1.
  • E. García-Martín, C. F. Rodrigues, G. Riley, and H. Grahn (2019) Estimation of energy consumption in machine learning. Journal of Parallel and Distributed Computing 134, pp. 75–88. Cited by: §8.3.1.
  • S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith (2020) RealToxicityPrompts: evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369. Cited by: Table 3.
  • C. Gentry and S. Halevi (2011) Implementing gentry’s fully-homomorphic encryption scheme. In Annual international conference on the theory and applications of cryptographic techniques, pp. 129–148. Cited by: §6.2.2.
  • C. Gentry (2009) Fully homomorphic encryption using ideal lattices. In

    Proceedings of the forty-first annual ACM symposium on Theory of computing

    pp. 169–178. Cited by: §6.2.2.
  • A. Ghorbani, A. Abid, and J. Zou (2019) Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3681–3688. Cited by: 1st item.
  • L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal (2018) Explaining explanations: an overview of interpretability of machine learning. In

    2018 IEEE 5th International Conference on data science and advanced analytics (DSAA)

    pp. 80–89. Cited by: §5.1.1, §5.4.1.
  • N. Goel, M. Yaghini, and B. Faltings (2018a) Non-discriminatory machine learning through convex fairness criteria. Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. Cited by: §4.2.
  • N. Goel, M. Yaghini, and B. Faltings (2018b) Non-discriminatory machine learning through convex fairness criteria. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: Table 3.
  • H. Gonen and Y. Goldberg (2019) Lipstick on a pig: debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv preprint arXiv:1903.03862. Cited by: Table 3.
  • H. Gonen and K. Webster (2020) Automatically identifying gender issues in machine translation using perturbations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 1991–1995. Cited by: Table 3.
  • Z. Gong, W. Wang, and W. Ku (2017) Adversarial and clean data are not twins. arXiv preprint arXiv:1704.04960. Cited by: §3.4.3.
  • S.T. Gooden (2015) Race and social equity: a nervous area of government. Taylor & Francis. External Links: ISBN 9781317461456, Link Cited by: 3rd item.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §3.1.1, §3.1.1.
  • [142] (2019) Governance principles for the new generation artificial intelligence–developing responsible artificial intelligence. Note: Accessed March 18, 2021 Cited by: 3rd item.
  • B. Green and Y. Chen (2019) Disparate interactions: an algorithm-in-the-loop analysis of fairness in risk assessments. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 90–99. Cited by: §7.1.2.
  • K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. McDaniel (2017) On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280. Cited by: §3.4.3.
  • R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi (2018) A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51 (5), pp. 1–42. Cited by: §5.4.1, Table 4.
  • S. Hajian and J. Domingo-Ferrer (2012) A methodology for direct and indirect discrimination prevention in data mining. IEEE transactions on knowledge and data engineering 25 (7), pp. 1445–1459. Cited by: Table 2.
  • S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, et al. (2017) Ese: efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 75–84. Cited by: §8.1.3.
  • S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally (2016) EIE: efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News 44 (3), pp. 243–254. Cited by: §8.1.3.
  • S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §8.1.1.
  • A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §3.5.3.
  • M. Hardt, E. Price, and N. Srebro (2016a)

    Equality of opportunity in supervised learning

    In NIPS, Cited by: §4.1.2, Table 2.
  • M. Hardt, E. Price, and N. Srebro (2016b) Equality of opportunity in supervised learning. arXiv preprint arXiv:1610.02413. Cited by: §4.2, §4.2, Table 2, Table 3.
  • C. He, S. Li, J. So, M. Zhang, H. Wang, X. Wang, P. Vepakomma, A. Singh, H. Qiu, L. Shen, P. Zhao, Y. Kang, Y. Liu, R. Raskar, Q. Yang, M. Annavaram, and S. Avestimehr (2020) FedML: a research library and benchmark for federated machine learning. arXiv preprint arXiv:2007.13518. Cited by: §6.4.2.
  • Ú. Hébert-Johnson, M. P. Kim, O. Reingold, and G. N. Rothblum (2017) Calibration for the (computationally-identifiable) masses. ArXiv abs/1711.08513. Cited by: Table 2.
  • P. Henderson, K. Sinha, N. Angelard-Gontier, N. R. Ke, G. Fried, R. Lowe, and J. Pineau (2018) Ethical challenges in data-driven dialogue systems. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 123–129. Cited by: §1.
  • M. Hildebrandt (2019) Privacy as protection of the incomputable self: from agnostic to agonistic machine learning. Theoretical Inquiries in Law 20 (1), pp. 83–121. Cited by: §4.1.1.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §8.1.1.
  • L. Hou and J. T. Kwok (2018) Loss-aware weight quantization of deep networks. arXiv preprint arXiv:1802.08635. Cited by: §8.1.1.
  • A. Howard and J. Borenstein (2018) The ugly truth about ourselves and our robot creations: the problem of bias and social inequity. Science and engineering ethics 24 (5), pp. 1521–1536. Cited by: §1, §4.3.2, §4.3.4, Table 3.
  • H. Hu, Z. Salcic, G. Dobbie, and X. Zhang (2021) Membership inference attacks on machine learning: a survey. External Links: 2103.07853 Cited by: §6.2.1.
  • X. Huang, L. Xing, F. Dernoncourt, and M. J. Paul (2020) Multilingual twitter corpus and baselines for evaluating demographic bias in hate speech recognition. arXiv preprint arXiv:2002.10361. Cited by: §4.1.1, Table 3.
  • B. Hutchinson and M. Mitchell (2019) 50 years of test (un) fairness: lessons for machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 49–58. Cited by: §4.4.1.
  • V. Iosifidis, B. Fetahu, and E. Ntoutsi (2019) FAE: a fairness-aware ensemble framework. 2019 IEEE International Conference on Big Data (Big Data), pp. 1375–1380. Cited by: Table 2.
  • M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: §8.1.1.
  • M. Jagielski, J. Ullman, and A. Oprea (2020) Auditing differentially private machine learning: how private is private sgd?. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 22205–22216. External Links: Link Cited by: §6.3.4.
  • Z. Ji, Z. C. Lipton, and C. Elkan (2014) Differential privacy and machine learning: a survey and review. arXiv preprint arXiv:1412.7584. Cited by: §6.4.1.
  • H. Jiang and O. Nachum (2020) Identifying and correcting label bias in machine learning. In International Conference on Artificial Intelligence and Statistics, pp. 702–712. Cited by: Table 2.
  • J. Jiménez-Luna, F. Grisoni, and G. Schneider (2020) Drug discovery with explainable artificial intelligence. Nature Machine Intelligence 2 (10), pp. 573–584. Cited by: §5.3.2, §5.4.1, Table 4.
  • W. Jin, Y. Li, H. Xu, Y. Wang, and J. Tang (2020) Adversarial attacks and defenses on graphs: a review and empirical study. arXiv preprint arXiv:2003.00653. Cited by: §3.2, §3.6.1.
  • K. Joseph, L. Friedland, W. Hobbs, O. Tsur, and D. Lazer (2017) Constance: modeling annotation contexts to improve stance classification. arXiv preprint arXiv:1708.06309. Cited by: §4.1.1.
  • M. Joseph, M. Kearns, J. H. Morgenstern, S. Neel, and A. Roth (2016) Fair algorithms for infinite and contextual bandits. arXiv: Learning. Cited by: §4.2.
  • P. Kairouz, J. Liao, C. Huang, and L. Sankar (2019) Censored and fair universal representations using generative adversarial models. arXiv preprint arXiv:1910.00411. Cited by: Table 2.
  • G. A. Kaissis, M. R. Makowski, D. Rückert, and R. F. Braren (2020) Secure, privacy-preserving and federated machine learning in medical imaging. Nature Machine Intelligence 2 (6), pp. 305–311. Cited by: §6.3.1, §6.3.2.
  • F. Kamiran and T. Calders (2011) Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems 33, pp. 1–33. Cited by: §4.2.
  • F. Kamiran and I. Žliobaitė (2013) Explainable and non-explainable discrimination in classification. In Discrimination and Privacy in the Information Society, Cited by: §4.1.1.
  • F. Kamiran and T. Calders (2009) Classifying without discriminating. In 2009 2nd International Conference on Computer, Control and Communication, pp. 1–6. Cited by: §4.3.1, Table 3.
  • F. Kamiran and T. Calders (2012) Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems 33 (1), pp. 1–33. Cited by: §4.2, Table 2.
  • T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma (2012) Fairness-aware classifier with prejudice remover regularizer. In ECML/PKDD, Cited by: §4.2.
  • M. Kapralov and K. Talwar (2013) On differentially private low rank approximation. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pp. 1395–1414. Cited by: §6.3.3.
  • S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh (2020) SCAFFOLD: stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132–5143. Cited by: §6.2.2.
  • A. Khaled, K. Mishchenko, and P. Richtárik (2020) Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pp. 4519–4529. Cited by: §6.2.2.
  • N. Kilbertus, M. Rojas-Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Schölkopf (2017) Avoiding discrimination through causal reasoning. In NIPS, Cited by: Table 2.
  • M. P. Kim, O. Reingold, and G. N. Rothblum (2018) Fairness through computationally-bounded awareness. In NeurIPS, Cited by: Table 2.
  • Y. Kim and A. M. Rush (2016) Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947. Cited by: §8.1.1.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.5.4.
  • S. Kiritchenko and S. M. Mohammad (2018) Examining gender and race bias in two hundred sentiment analysis systems. arXiv preprint arXiv:1805.04508. Cited by: Table 3.
  • J. N. Kok, E. J. Boers, W. A. Kosters, P. Van der Putten, and M. Poel (2009) Artificial intelligence: definition, trends, techniques, and cases. Artificial intelligence 1, pp. 270–299. Cited by: §2.
  • E. Krasanakis, E. Spyromitros-Xioufis, S. Papadopoulos, and Y. Kompatsiaris (2018) Adaptive sensitive reweighting to mitigate bias in fairness-aware classification. In Proceedings of the 2018 World Wide Web Conference, pp. 853–862. Cited by: §4.2, Table 2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §3.2.
  • M. J. Kusner, J. R. Loftus, C. Russell, and R. Silva (2017) Counterfactual fairness. In NIPS, Cited by: §4.1.2.
  • A. Lacoste, A. Luccioni, V. Schmidt, and T. Dandres (2019) Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700. Cited by: §8.3.2.
  • A. Lambrecht and C. Tucker (2019) Algorithmic bias? an empirical study of apparent gender-based discrimination in the display of stem career ads. Management Science 65 (7), pp. 2966–2981. Cited by: §4.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019) Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. Cited by: §8.1.1.
  • [194] (2021) LEAF: a benchmark for federated settings. Note: Cited by: §6.4.2.
  • Y. LeCun, Y. Bengio, et al. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361 (10), pp. 1995. Cited by: §3.2.
  • D. Lee, D. Kohlbrenner, S. Shinde, K. Asanovic, and D. Song (2020) Keystone: an open framework for architecting trusted execution environments. In Proceedings of the Fifteenth European Conference on Computer Systems, EuroSys ’20. Cited by: §6.4.2.
  • J. D. Lee and K. A. See (2004) Trust in automation: designing for appropriate reliance. Human factors 46 (1), pp. 50–80. Cited by: §2.
  • Q. Lei, L. Wu, P. Chen, A. G. Dimakis, I. S. Dhillon, and M. Witbrock (2018) Discrete attacks and submodular optimization with applications to text classification. CoRR abs/1812.00151. External Links: Link, 1812.00151 Cited by: §3.5.2.
  • J. Levinson, J. Askeland, J. Becker, J. Dolson, D. Held, S. Kammel, J. Z. Kolter, D. Langer, O. Pink, V. Pratt, et al. (2011) Towards fully autonomous driving: systems and algorithms. In 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 163–168. Cited by: §5.
  • D. Li, X. Chen, M. Becchi, and Z. Zong (2016) Evaluating the energy efficiency of deep convolutional neural networks on cpus and gpus. In 2016 IEEE international conferences on big data and cloud computing (BDCloud), social computing and networking (SocialCom), sustainable computing and communications (SustainCom)(BDCloud-SocialCom-SustainCom), pp. 477–484. Cited by: §8.2.
  • T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith (2020a) Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020, I. S. Dhillon, D. S. Papailiopoulos, and V. Sze (Eds.), External Links: Link Cited by: §6.2.2.
  • X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang (2020b) On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, External Links: Link Cited by: §6.2.2.
  • P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis (2021) Explainable ai: a review of machine learning interpretability methods. Entropy 23 (1), pp. 18. Cited by: §5.4.1, Table 4, §5.
  • S. Lins, S. Schneider, J. Szefer, S. Ibraheem, and A. Sunyaev (2019) Designing monitoring systems for continuous certification of cloud services: deriving meta-requirements and design guidelines. Communications of the Association for Information Systems 44 (1), pp. 25. Cited by: §7.3.
  • H. Liu, J. Dacon, W. Fan, H. Liu, Z. Liu, and J. Tang (2020a) Does gender matter? towards fairness in dialogue systems. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 4403–4416. Cited by: §4.3.3, Table 3, §4.
  • H. Liu, T. Derr, Z. Liu, and J. Tang (2019) Say what i want: towards the dark side of neural dialogue models. arXiv preprint arXiv:1909.06044. Cited by: §3.5.2.
  • H. Liu, W. Jin, H. Karimi, Z. Liu, and J. Tang (2021a) The authors matter: understanding and mitigating implicit bias in deep text classification. arXiv preprint arXiv:2105.02778. Cited by: 2nd item, §4.1.1, Table 2.
  • H. Liu, W. Wang, Y. Wang, H. Liu, Z. Liu, and J. Tang (2020b) Mitigating gender bias for neural dialogue generation with adversarial learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 893–903. Cited by: §4.2, §4.3.3, Table 2.
  • X. Liu, Y. Li, R. Wang, J. Tang, and M. Yan (2021b) Linear convergent decentralized optimization with compression. In International Conference on Learning Representations, External Links: Link Cited by: §6.2.2.
  • X. Liu, M. Mao, B. Liu, H. Li, Y. Chen, B. Li, Y. Wang, H. Jiang, M. Barnell, Q. Wu, et al. (2015) RENO: a high-efficient reconfigurable neuromorphic computing accelerator design. In Proceedings of the 52nd Annual Design Automation Conference, pp. 1–6. Cited by: §8.1.3.
  • Y. Liu, G. Radanovic, C. Dimitrakakis, D. Mandal, and D. C. Parkes (2017) Calibrated fairness in bandits. arXiv preprint arXiv:1707.01875. Cited by: Table 2.
  • G. Louppe, M. Kagan, and K. Cranmer (2016) Learning to pivot with adversarial networks. arXiv preprint arXiv:1611.01046. Cited by: §4.4.2.
  • K. Lu, P. Mardziel, F. Wu, P. Amancharla, and A. Datta (2020) Gender bias in neural natural language processing. In Logic, Language, and Security, pp. 189–202. Cited by: Table 3.
  • S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems 30, pp. 4765–4774. Cited by: 2nd item, Table 4.
  • D. Luo, W. Cheng, D. Xu, W. Yu, B. Zong, H. Chen, and X. Zhang (2020) Parameterized explainer for graph neural network. arXiv preprint arXiv:2011.04573. Cited by: 2nd item, §5.3.2.
  • W. Ma, M. Zhang, Y. Cao, W. Jin, C. Wang, Y. Liu, S. Ma, and X. Ren (2019a)

    Jointly learning explainable rules for recommendation with knowledge graph

    In The World Wide Web Conference, pp. 1210–1221. Cited by: §5.3.1.
  • Y. Ma, S. Wang, T. Derr, L. Wu, and J. Tang (2019b) Attacking graph convolutional networks via rewiring. External Links: 1906.03750 Cited by: §3.2.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §3.3.1, §3.
  • B. Marlin, R. S. Zemel, S. Roweis, and M. Slaney (2012) Collaborative filtering and the missing at random assumption. arXiv preprint arXiv:1206.5267. Cited by: §4.1.1.
  • K. Martin (2019) Ethical implications and accountability of algorithms. Journal of Business Ethics 160 (4), pp. 835–850. Cited by: §7.1.1.
  • C. May, A. Wang, S. Bordia, S. R. Bowman, and R. Rudinger (2019a) On measuring social biases in sentence encoders. arXiv preprint arXiv:1903.10561. Cited by: §4.3.3.
  • C. May, A. Wang, S. Bordia, S. R. Bowman, and R. Rudinger (2019b) On measuring social biases in sentence encoders. arXiv preprint arXiv:1903.10561. Cited by: Table 3.
  • R. C. Mayer, J. H. Davis, and F. D. Schoorman (1995) An integrative model of organizational trust. Academy of management review 20 (3), pp. 709–734. Cited by: §2.
  • J. McCarthy, M. L. Minsky, N. Rochester, and C. E. Shannon (2006) A proposal for the dartmouth summer research project on artificial intelligence, august 31, 1955. AI magazine 27 (4), pp. 12–12. Cited by: §2.
  • H. B. McMahan et al. (2021) Advances and open problems in federated learning. Foundations and Trends in Machine Learning 14 (1). Cited by: §6.2.2, §6.2.2, §6.4.1.
  • F. McSherry and I. Mironov (2009) Differentially private recommender systems: building privacy into the netflix prize contenders. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 627–636. Cited by: §6.3.3.
  • F. McSherry and K. Talwar (2007) Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp. 94–103. Cited by: §6.2.2.
  • N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan (2019) A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635. Cited by: 3rd item, §4.1.1, §4.1.2, §4.4.1, §4.
  • A. Menon and R. Williamson (2017) The cost of fairness in classification. ArXiv abs/1705.09055. Cited by: Table 2.
  • A. K. Menon and R. C. Williamson (2018) The cost of fairness in binary classification. In Conference on Fairness, Accountability and Transparency, pp. 107–118. Cited by: Table 3.
  • P. Michel, O. Levy, and G. Neubig (2019) Are sixteen heads really better than one?. arXiv preprint arXiv:1905.10650. Cited by: §8.1.1.
  • T. Miller (2019) Explanation in artificial intelligence: insights from the social sciences. Artificial intelligence 267, pp. 1–38. Cited by: §5.1.1, Table 4.
  • R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley (2018) Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics 19 (6), pp. 1236–1246. Cited by: §5.
  • S. Mittal and J. S. Vetter (2014) A survey of methods for analyzing and improving gpu energy efficiency. ACM Computing Surveys (CSUR) 47 (2), pp. 1–23. Cited by: §8.3.1.
  • B. Mittelstadt, C. Russell, and S. Wachter (2019) Explaining explanations in ai. In Proceedings of the conference on fairness, accountability, and transparency, pp. 279–288. Cited by: §5.2.4.
  • P. Mohassel and Y. Zhang (2017) Secureml: a system for scalable privacy-preserving machine learning. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 19–38. Cited by: §6.2.2, §6.3.4.
  • C. Molnar (2020) Interpretable machine learning. Lulu. com. Cited by: Figure 4, §5.1.1, §5.2.4, §5.4.1, Table 4, §5.
  • S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard (2017) Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1765–1773. Cited by: §3.3.1.
  • L. Moy (2019) How police technology aggravates racial inequity: a taxonomy of problems and a path forward. Available at SSRN 3340898. Cited by: §7.1.2.
  • J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, and J. Eisenstein (2018) Explainable prediction of medical codes from clinical text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1101–1111. Cited by: §5.3.3.
  • R. Nabi and I. Shpitser (2018) Fair inference on outcomes. Proceedings of the … AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence 2018, pp. 1931–1940. Cited by: Table 2.
  • V. Nikolaenko, S. Ioannidis, U. Weinsberg, M. Joye, N. Taft, and D. Boneh (2013) Privacy-preserving matrix factorization. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pp. 801–812. Cited by: §6.3.3.
  • T. Niu and M. Bansal (2018) Adversarial over-sensitivity and over-stability strategies for dialogue models. arXiv preprint arXiv:1809.02079. Cited by: §3.2.
  • A. Noack, I. Ahern, D. Dou, and B. Li (2021) An empirical study on the relation between network interpretability and adversarial robustness. SN Computer Science 2 (1), pp. 1–13. Cited by: §1, §9.1.
  • H. Nori, S. Jenkins, P. Koch, and R. Caruana (2019) InterpretML: a unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223. Cited by: §5.4.2.
  • [246] (2021) OenDP: open source tools for differential privacy. Note: Cited by: §6.4.2.
  • F. of Life Institute (2017) Asilomar ai principles. Future of Life Institute. Note: Accessed March 18, 2021 Cited by: 2nd item.
  • A. Olteanu, C. Castillo, F. Diaz, and E. Kiciman (2019) Social data: biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data 2, pp. 13. Cited by: §4.1.1.
  • [249] (2021) Opacus: train pytorch models with differential privacy. Note: Cited by: §6.4.2.
  • P. Pachal (2015) Google photos identified two black people as’ gorillas’. Mashable, July 1. Cited by: §4.3.2, Table 3.
  • [251] (2021) Paddle federated learning. Note: Cited by: §6.4.2.
  • O. Papakyriakopoulos, S. Hegelich, J. C. M. Serrano, and F. Marco (2020) Bias in word embeddings. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 446–457. Cited by: Table 3.
  • A. Parashar, P. Raina, Y. S. Shao, Y. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer (2019) Timeloop: a systematic approach to dnn accelerator evaluation. In 2019 IEEE international symposium on performance analysis of systems and software (ISPASS), pp. 304–315. Cited by: §8.3.2.
  • J. H. Park, J. Shin, and P. Fung (2018) Reducing gender bias in abusive language detection. arXiv preprint arXiv:1808.07231. Cited by: Table 3.
  • B. Plank, D. Hovy, and A. Søgaard (2014) Learning part-of-speech taggers with inter-annotator agreement loss. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 742–751. Cited by: §4.1.1.
  • G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, and K. Q. Weinberger (2017) On fairness and calibration. In NIPS, Cited by: §4.2.
  • M. O. Prates, P. H. Avelar, and L. C. Lamb (2019) Assessing gender bias in machine translation: a case study with google translate. Neural Computing and Applications, pp. 1–19. Cited by: §4.3.3.
  • K. Preuer, G. Klambauer, F. Rippmann, S. Hochreiter, and T. Unterthiner (2019) Interpretable deep learning in drug discovery. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 331–345. Cited by: §5.3.2.
  • F. Prost, H. Qian, Q. Chen, E. H. Chi, J. Chen, and A. Beutel (2019) Toward a better trade-off between performance and fairness with kernel-based distribution matching. arXiv preprint arXiv:1910.11779. Cited by: 1st item.
  • J. R. Quinlan (1986) Induction of decision trees. Machine learning 1 (1), pp. 81–106. Cited by: 1st item.
  • A. Raghunathan, J. Steinhardt, and P. Liang (2018) Certified defenses against adversarial examples. arXiv preprint arXiv:1801.09344. Cited by: §3.4.2.
  • I. D. Raji, A. Smart, R. N. White, M. Mitchell, T. Gebru, B. Hutchinson, J. Smith-Loud, D. Theron, and P. Barnes (2020) Closing the ai accountability gap: defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 33–44. Cited by: §7.1.2.
  • W. Rawat and Z. Wang (2017) Deep convolutional neural networks for image classification: a comprehensive review. Neural computation 29 (9), pp. 2352–2449. Cited by: §2.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ”Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: Figure 5, 2nd item, §5.3.3, Table 4.
  • L. Rice, E. Wong, and Z. Kolter (2020) Overfitting in adversarially robust deep learning. In International Conference on Machine Learning, pp. 8093–8104. Cited by: §3.4.1.
  • N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albarqouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein, et al. (2020) The future of digital health with federated learning. NPJ digital medicine 3 (1), pp. 1–7. Cited by: §6.3.1.
  • M. Rigaki and S. Garcia (2020) A survey of privacy attacks in machine learning. arXiv preprint arXiv:2007.07646. Cited by: §6.4.1.
  • R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua (2013) Learning separable filters. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2754–2761. Cited by: §8.1.1.
  • J. A. Rodger and P. C. Pendharkar (2004) A field study of the impact of gender and user’s technical experience on the performance of voice-activated medical tracking application. International Journal of Human-Computer Studies 60 (5-6), pp. 529–544. Cited by: §1, §4.3.4, Table 3.
  • C. F. Rodrigues, G. Riley, and M. Luján (2018) SyNERGY: an energy measurement and prediction framework for convolutional neural networks on jetson tx1. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), pp. 375–382. Cited by: §8.3.2.
  • A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §8.1.1.
  • A. Rose (2010) Are face-detection cameras racist. Time Business 1. Cited by: §1.
  • B. D. Rouhani, M. S. Riazi, and F. Koushanfar (2018) Deepsecure: scalable provably-secure deep learning. In Proceedings of the 55th Annual Design Automation Conference, pp. 1–6. Cited by: §6.3.4.
  • B. I. Rubinstein and F. Alda (2017) Diffpriv: an r package for easy differential privacy. Note: Cited by: §6.4.2.
  • C. Rudin (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206–215. Cited by: §5.1.1.
  • S. Russell and P. Norvig (2002) Artificial intelligence: a modern approach. Cited by: §2.
  • H. J. Ryu, M. Mitchell, and H. Adam (2017) Improving smiling detection with race and gender diversity. arXiv preprint arXiv:1712.00193 1 (2), pp. 7. Cited by: §4.3.2, Table 3.
  • P. Saadatpanah, A. Shafahi, and T. Goldstein (2020) Adversarial attacks on copyright detection systems. In International Conference on Machine Learning, pp. 8307–8315. Cited by: §3.5.3.
  • M. Sabt, M. Achemlal, and A. Bouabdallah (2015) Trusted execution environment: what it is, and what it is not. In 2015 IEEE Trustcom/BigDataSE/ISPA, Vol. 1, pp. 57–64. Cited by: §6.2.2.
  • A. Sadeghi, T. Schneider, and I. Wehrenberg (2009) Efficient privacy-preserving face recognition. In International Conference on Information Security and Cryptology, pp. 229–244. Cited by: §6.3.2.
  • P. Sajda (2006) Machine learning for detection and diagnosis of disease. Annu. Rev. Biomed. Eng. 8, pp. 537–565. Cited by: §1.
  • P. Saleiro, B. Kuester, L. Hinkson, J. London, A. Stevens, A. Anisfeld, K. T. Rodolfa, and R. Ghani (2018) Aequitas: a bias and fairness audit toolkit. arXiv preprint arXiv:1811.05577. Cited by: §4.4.2.
  • C. Sandvig, K. Hamilton, K. Karahalios, and C. Langbort (2014) Auditing algorithms: research methods for detecting discrimination on internet platforms. Data and discrimination: converting critical concerns into productive inquiry 22, pp. 4349–4357. Cited by: §7.1.2.
  • M. Sap, D. Card, S. Gabriel, Y. Choi, and N. A. Smith (2019) The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1668–1678. Cited by: §4.1.1.
  • N. A. Saxena, K. Huang, E. DeFilippis, G. Radanovic, D. C. Parkes, and Y. Liu (2019) How do fairness definitions fare? examining public attitudes towards algorithmic definitions of fairness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 99–106. Cited by: §4.1.2.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE transactions on neural networks 20 (1), pp. 61–80. Cited by: §3.2, §3.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: Figure 10, 1st item, Table 4.
  • A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein (2018) Poison frogs! targeted clean-label poisoning attacks on neural networks. In Advances in Neural Information Processing Systems, pp. 6103–6113. Cited by: §3.3.2.
  • A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein (2019) Adversarial training for free!. arXiv preprint arXiv:1904.12843. Cited by: §3.4.1.
  • D. S. Shah, H. A. Schwartz, and D. Hovy (2020) Predictive biases in natural language processing models: a conceptual framework and overview. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5248–5264. Cited by: §4.1.1.
  • D. Shah, H. A. Schwartz, and D. Hovy (2019) Predictive biases in natural language processing models: a conceptual framework and overview. arXiv preprint arXiv:1912.11078. Cited by: §4.1.1.
  • W. Shang, K. Sohn, D. Almeida, and H. Lee (2016)

    Understanding and improving convolutional neural networks via concatenated rectified linear units

    In international conference on machine learning, pp. 2217–2225. Cited by: §8.1.1.
  • M. J. Sheller, B. Edwards, G. A. Reina, J. Martin, S. Pati, A. Kotrotsou, M. Milchenko, W. Xu, D. Marcus, R. R. Colen, et al. (2020) Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Scientific reports 10 (1), pp. 1–12. Cited by: §6.3.1.
  • E. Sheng, K. Chang, P. Natarajan, and N. Peng (2019) The woman worked as a babysitter: on biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3407–3412. Cited by: Table 3.
  • R. Shokri, M. Stronati, C. Song, and V. Shmatikov (2017) Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §6.2.1.
  • K. Shyong, D. Frankowski, J. Riedl, et al. (2006) Do you trust your recommendations? an exploration of security and privacy issues in recommender systems. In International Conference on Emerging Trends in Information and Communication Security, pp. 14–29. Cited by: §6.3.3.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: Figure 7, Figure 8, 1st item, 2nd item, Table 4.
  • C. Sitawarin, A. N. Bhagoji, A. Mosenia, M. Chiang, and P. Mittal (2018) Darts: deceiving autonomous cars with toxic signs. arXiv preprint arXiv:1802.06430. Cited by: §3.5.1, §3.
  • N. A. Smuha (2019) The eu approach to ethics guidelines for trustworthy artificial intelligence. Computer Law Review International 20 (4), pp. 97–106. Cited by: §1, §1, §1, §1, §10.1, §8.
  • L. Song, R. Shokri, and P. Mittal (2019) Privacy risks of securing machine learning models against adversarial examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 241–257. Cited by: §1, §9.2.
  • S. Song, K. Chaudhuri, and A. D. Sarwate (2013) Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, pp. 245–248. Cited by: §6.3.4.
  • D. Stamoulis, T. Chin, A. K. Prakash, H. Fang, S. Sajja, M. Bognar, and D. Marculescu (2018) Designing adaptive neural networks for energy-constrained image classification. In Proceedings of the International Conference on Computer-Aided Design, pp. 1–8. Cited by: §8.1.2.
  • G. Stanovsky, N. A. Smith, and L. Zettlemoyer (2019) Evaluating gender bias in machine translation. arXiv preprint arXiv:1906.00591. Cited by: Table 3.
  • E. Strubell, A. Ganesh, and A. McCallum (2019) Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243. Cited by: §1, §8.2, Table 5, §8, §9.1.
  • S. Sun, Y. Cheng, Z. Gan, and J. Liu (2019) Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355. Cited by: §8.1.1.
  • V. Sze, Y. Chen, T. Yang, and J. S. Emer (2017) Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. Cited by: §2.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §3.2, §3.
  • R. Tang, Y. Lu, L. Liu, L. Mou, O. Vechtomova, and J. Lin (2019) Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136. Cited by: §8.1.1.
  • R. Tatman (2016) Google’s speech recognition has a gender bias. Making Noise and Hearing Things 12. Cited by: §4.3.4, Table 3.
  • [310] (2021) TensorFlow federated. Note: Cited by: §6.4.2.
  • [311] (2021) TensorFlow privacy. Note: Cited by: §6.4.2.
  • [312] (2017) The montreal declaration of responsible ai. Note: Accessed March 18, 2021 Cited by: 3rd item.
  • S. Thiebes, S. Lins, and A. Sunyaev (2020) Trustworthy artificial intelligence. Electronic Markets, pp. 1–18. Cited by: §1, §1, §2.
  • E. Tjoa and C. Guan (2020) A survey on explainable artificial intelligence (xai): toward medical xai. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §5.4.1, Table 4.
  • F. Tramer, V. Atlidakis, R. Geambasu, D. Hsu, J. Hubaux, M. Humbert, A. Juels, and H. Lin (2017) Fairtest: discovering unwarranted associations in data-driven applications. In 2017 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 401–416. Cited by: §4.4.2.
  • F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart (2016) Stealing machine learning models via prediction apis. In 25th USENIX Security Symposium (USENIX Security 16), pp. 601–618. Cited by: §6.2.1.
  • D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2018) Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152. Cited by: §1, §3.4.1.
  • Z. Tufekci (2014) Big questions for social media big data: representativeness, validity and other methodological pitfalls. arXiv preprint arXiv:1403.7400. Cited by: §4.1.1.
  • J. Vamathevan, D. Clark, P. Czodrowski, I. Dunham, E. Ferran, G. Lee, B. Li, A. Madabhushi, P. Shah, M. Spitzer, et al. (2019) Applications of machine learning in drug discovery and development. Nature Reviews Drug Discovery 18 (6), pp. 463–477. Cited by: §5.3.2.
  • E. Vanmassenhove, C. Hardmeier, and A. Way (2019) Getting gender right in neural machine translation. arXiv preprint arXiv:1909.05088. Cited by: Table 3.
  • S. Wachter, B. Mittelstadt, and C. Russell (2017) Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harv. JL & Tech. 31, pp. 841. Cited by: 4th item, §5.2.4.
  • T. Wang, Z. Zheng, A. Bashir, A. Jolfaei, and Y. Xu (2020a) Finprivacy: a privacy-preserving mechanism for fingerprint identification. ACM Transactions on Internet Technology. Cited by: §6.3.2.
  • X. Wang, X. He, F. Feng, L. Nie, and T. Chua (2018) Tem: tree-enhanced embedding model for explainable recommendation. In Proceedings of the 2018 World Wide Web Conference, pp. 1543–1552. Cited by: §5.3.1.
  • Y. Wang, Q. Wang, S. Shi, X. He, Z. Tang, K. Zhao, and X. Chu (2020b) Benchmarking the performance and energy efficiency of ai accelerators for ai training. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp. 744–751. Cited by: §8.3.1.
  • Z. Wang, K. Qinami, I. C. Karakozis, K. Genova, P. Nair, K. Hata, and O. Russakovsky (2020c) Towards fairness in visual recognition: effective strategies for bias mitigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8919–8928. Cited by: §3.4.1, §4.3.2.
  • S. L. Warner (1965) Randomized response: a survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60 (309), pp. 63–69. Cited by: §6.2.2.
  • K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. Quek, and H. V. Poor (2020) Federated learning with differential privacy: algorithms and performance analysis. IEEE Transactions on Information Forensics and Security 15, pp. 3454–3469. Cited by: §6.2.2.
  • J. Whittlestone, R. Nyrup, A. Alexandrova, and S. Cave (2019) The role and limits of principles in ai ethics: towards a focus on tensions. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 195–200. Cited by: §1.
  • M. Wieringa (2020) What to account for when accounting for algorithms: a systematic literature review on algorithmic accountability. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 1–18. Cited by: §7.1.1, §7.2.1.
  • M. J. Wolf, K. W. Miller, and F. S. Grodzinsky (2017) Why we should have seen that coming: comments on microsoft’s tay “experiment,” and wider implications. The ORBIT Journal 1 (2), pp. 1–12. Cited by: §1, §4.
  • D. H. Wolpert and W. G. Macready (1997) No free lunch theorems for optimization.

    IEEE transactions on evolutionary computation

    1 (1), pp. 67–82.
    Cited by: §4.1.1.
  • E. Wong and Z. Kolter (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5286–5295. Cited by: §3.4.2.
  • E. Wong, L. Rice, and J. Z. Kolter (2020) Fast is better than free: revisiting adversarial training. External Links: 2001.03994 Cited by: §3.4.1.
  • E. Wong, F. Schmidt, and Z. Kolter (2019) Wasserstein adversarial examples via projected sinkhorn iterations. In International Conference on Machine Learning, pp. 6808–6817. Cited by: §3.3.1.
  • D. Wu, S. Xia, and Y. Wang (2020a) Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems 33. Cited by: §3.4.1.
  • K. Wu, A. Wang, and Y. Yu (2020b) Stronger and faster wasserstein adversarial attacks. In International Conference on Machine Learning, pp. 10377–10387. Cited by: §3.3.1.
  • Y. N. Wu, J. S. Emer, and V. Sze (2019) Accelergy: an architecture-level energy estimation methodology for accelerator designs. In 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8. Cited by: §8.3.2.
  • D. Xu, S. Yuan, L. Zhang, and X. Wu (2018) FairGAN: fairness-aware generative adversarial networks. 2018 IEEE International Conference on Big Data (Big Data), pp. 570–575. Cited by: §4.2.
  • H. Xu, X. Liu, Y. Li, and J. Tang (2020a) To be robust or to be fair: towards fairness in adversarial training. arXiv preprint arXiv:2010.06121. Cited by: §9.2.
  • H. Xu, Y. Ma, H. Liu, D. Deb, H. Liu, J. Tang, and A. K. Jain (2020b) Adversarial attacks and defenses in images, graphs and text: a review. International Journal of Automation and Computing 17 (2), pp. 151–178. Cited by: §1, §3.5.1, §3.6.1.
  • J. Xu, B. S. Glicksberg, C. Su, P. Walker, J. Bian, and F. Wang (2021) Federated learning for healthcare informatics. Journal of Healthcare Informatics Research 5 (1), pp. 1–19. Cited by: §6.3.1.
  • Q. Yang, Y. Liu, T. Chen, and Y. Tong (2019) Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 1–19. Cited by: §6.4.1.
  • T. Yang, Y. Chen, and V. Sze (2017) Designing energy-efficient convolutional neural networks using energy-aware pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5687–5695. Cited by: §8.1.2.
  • A. C. Yao (1982) Protocols for secure computations. In 23rd annual symposium on foundations of computer science (sfcs 1982), pp. 160–164. Cited by: §6.2.2.
  • C. Yeo and A. Chen (2020) Defining and evaluating fair natural language generation. In Proceedings of the The Fourth Widening Natural Language Processing Workshop, pp. 107–109. Cited by: Table 3.
  • R. Ying, D. Bourgeois, J. You, M. Zitnik, and J. Leskovec (2019) Gnnexplainer: generating explanations for graph neural networks. Advances in neural information processing systems 32, pp. 9240. Cited by: Figure 6, 2nd item, §5.3.2, Table 4.
  • H. Yu, Z. Shen, C. Miao, C. Leung, V. R. Lesser, and Q. Yang (2018) Building ethics into artificial intelligence. arXiv preprint arXiv:1812.02953. Cited by: §7.1.1.
  • H. Yuan, J. Tang, X. Hu, and S. Ji (2020a) Xgnn: towards model-level explanations of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 430–438. Cited by: §5.3.2.
  • H. Yuan, H. Yu, S. Gui, and S. Ji (2020b) Explainability in graph neural networks: a taxonomic survey. arXiv preprint arXiv:2012.15445. Cited by: §5.4.1, Table 4.
  • M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi (2017) Fairness beyond disparate treatment & disparate impact: learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pp. 1171–1180. Cited by: Table 2.
  • W. Zaremba, I. Sutskever, and O. Vinyals (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329. Cited by: §3.2, §3.
  • R. Zemel, L. Y. Wu, K. Swersky, T. Pitassi, and C. Dwork (2013) Learning fair representations. In ICML, Cited by: §4.2.
  • F. Zerka, S. Barakat, S. Walsh, M. Bogowicz, R. T. Leijenaar, A. Jochems, B. Miraglio, D. Townend, and P. Lambin (2020) Systematic review of privacy-preserving distributed machine learning from federated databases in health care. JCO clinical cancer informatics 4, pp. 184–200. Cited by: §6.3.1.
  • B. H. Zhang, B. Lemoine, and M. Mitchell (2018) Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335–340. Cited by: Table 2.
  • G. Zhang, B. Bai, J. Zhang, K. Bai, C. Zhu, and T. Zhao (2020a) Demographics should not be the reason of toxicity: mitigating discrimination in text classifications with instance weighting. arXiv preprint arXiv:2004.14088. Cited by: §4.2, Table 2, Table 3.
  • H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan (2019) Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pp. 7472–7482. Cited by: §3.4.1.
  • L. Zhang, Y. Wu, and X. Wu (2017) A causal framework for discovering and removing direct and indirect discrimination. In IJCAI, Cited by: §4.1.1.
  • S. Zhang, H. Yin, T. Chen, Z. Huang, L. Cui, and X. Zhang (2021) Graph embedding for recommendation against attribute inference attacks. arXiv preprint arXiv:2101.12549. Cited by: §6.3.3.
  • W. E. Zhang, Q. Z. Sheng, A. Alhazmi, and C. Li (2020b) Adversarial attacks on deep-learning models in natural language processing: a survey. ACM Transactions on Intelligent Systems and Technology (TIST) 11 (3), pp. 1–41. Cited by: §3.6.1.
  • X. Zhang, N. Wang, H. Shen, S. Ji, X. Luo, and T. Wang (2020c) Interpretable deep learning under fire. In 29th USENIX Security Symposium (USENIX Security 20), Cited by: 1st item.
  • Y. Zhang and X. Chen (2018) Explainable recommendation: a survey and new perspectives. arXiv preprint arXiv:1804.11192. Cited by: §5.1.1, §5.4.1, Table 4.
  • Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, and S. Ma (2014) Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp. 83–92. Cited by: §5.3.1.
  • Y. Zhang, R. Jia, H. Pei, W. Wang, B. Li, and D. Song (2020d) The secret revealer: generative model-inversion attacks against deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 253–261. Cited by: §6.2.1.
  • Z. Zhang and D. B. Neill (2016) Identifying significant predictive bias in classifiers. arXiv preprint arXiv:1611.08292. Cited by: Table 2.
  • B. Zhao, K. R. Mopuri, and H. Bilen (2020) Idlg: improved deep leakage from gradients. arXiv preprint arXiv:2001.02610. Cited by: §6.3.4.
  • J. Zhao, T. Wang, M. Yatskar, R. Cotterell, V. Ordonez, and K. Chang (2019) Gender bias in contextualized word embeddings. arXiv preprint arXiv:1904.03310. Cited by: Table 3.
  • J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K. Chang (2017) Men also like shopping: reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457. Cited by: §4.3.2.
  • J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K. Chang (2018) Gender bias in coreference resolution: evaluation and debiasing methods. arXiv preprint arXiv:1804.06876. Cited by: §4.3.3.
  • B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: Figure 9, 1st item, Table 4.
  • L. Zhu and S. Han (2020) Deep leakage from gradients. In Federated Learning, pp. 17–31. Cited by: §6.2.1, §6.2.2, §6.3.4.
  • I. Zliobaite (2015) A survey on measuring indirect discrimination in machine learning. arXiv preprint arXiv:1511.00148. Cited by: §4.4.1.
  • J. Zou and L. Schiebinger (2018) AI can be sexist and racist—it’s time to make it fair. Nature Publishing Group. Cited by: 5th item.
  • D. Zügner, A. Akbarnejad, and S. Günnemann (2018) Adversarial attacks on neural networks for graph data. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. External Links: ISBN 9781450355520, Link, Document Cited by: §3.2.
  • D. Zügner, A. Akbarnejad, and S. Günnemann (2018) Adversarial attacks on neural networks for graph data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2847–2856. Cited by: §3.1.1, §3.2, §3.5.4.
  • D. Zügner and S. Günnemann (2019) Adversarial attacks on graph neural networks via meta learning. arXiv preprint arXiv:1902.08412. Cited by: §3.5.4.