Building Legal Datasets

by   Jerrold Soh, et al.
Singapore Management University

Data-centric AI calls for better, not just bigger, datasets. As data protection laws with extra-territorial reach proliferate worldwide, ensuring datasets are legal is an increasingly crucial yet overlooked component of “better”. To help dataset builders become more willing and able to navigate this complex legal space, this paper reviews key legal obligations surrounding ML datasets, examines the practical impact of data laws on ML pipelines, and offers a framework for building legal datasets.



There are no comments yet.


page 1

page 2

page 3

page 4


Crowdsourced Databases and Sui Generis Rights

In this study we propose a new concept of databases (crowdsourced databa...

Legal Judgment Prediction (LJP) Amid the Advent of Autonomous AI Legal Reasoning

Legal Judgment Prediction (LJP) is a longstanding and open topic in the ...

Automatic Taxonomy Generation - A Use-Case in the Legal Domain

A key challenge in the legal domain is the adaptation and representation...

Robustness and Overcoming Brittleness of AI-Enabled Legal Micro-Directives: The Role of Autonomous Levels of AI Legal Reasoning

Recent research by legal scholars suggests that the law might inevitably...

An NLG pipeline for a legal expert system: a work in progress

We present the NLG component for L4, a prototype domain-specific languag...

On the Fairness of 'Fake' Data in Legal AI

The economics of smaller budgets and larger case numbers necessitates th...

A policy and legal Open Science framework: a proposal

Our proposal of an Open Science definition as a political and legal fram...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data-centric AI is about making better datasets. But what does “better” mean? Conventionally it has meant cheaper. That is, easier to crowdsource Irani and Silberman (2013), generate Collins et al. (2008), augment Dao et al. (2019), or broadly to collect Paullada et al. (2020). Bigger

is also often better, as the rise of large language models suggest

Hendrycks et al. (2020); Zhong et al. (2021). To statisticians, better typically means unbiased

, though “bias” is used differently from in the bias-variance tradeoff

Geman et al. (1992), or in algorithmic bias Friedman and Nissenbaum (1996). The growing “responsible AI” literature emphasizes that datasets are better when they are ethically and fairly sourced Paullada et al. (2020); Hutchinson et al. (2021); Rogers et al. (2021). This paper underscores legality as one desideratum for “better”. To this end, it reviews key legal obligations on data collection and use, examines the practical impact of data laws on ML pipelines, and offers a framework for thinking about data legality.

2 When are datasets legal?

Legal datasets may be understood broadly as datasets which are legally collected, retained, processed, and disseminated. This fourfold categorization builds off Solove’s classic taxonomy of privacy Solove (2006), and finds expression in a range of relatively new legislation worldwide. This notably includes the European Union’s (EU’s) General Data Protection Regulation (GDPR) which came into force in 2018. Parallel to the GDPR are national data laws, such as South Korea’s Personal Information Protection Act (passed in 2011), Singapore’s Personal Data Protection Act (passed in 2014) and, most recently, China’s Personal Information Protection Law (PIPL, August 2021). While the US does not presently have data legislation at the federal level, states like California, New York, and Massachusetts have passed data privacy acts. Further, legal scholars and courts have increasingly considered how pre-existing laws, such as copyright and anti-discrimination law, affect ML datasets Sag (2019); Mayson (2019); Gillis and Spiess (2019). As there are too many countries and variations to cover, I use the GDPR, PIPL, and California’s Consumer Privacy Act (CCPA, 2018) as case studies.

Although one jurisdiction’s laws generally do not apply in another, modern data laws tend to have extra-territorial effect. Both the GDPR and PIPL apply as long as any personal data about persons in the EU/China is processed for any commercial or behavioral monitoring purposes (GDPR, Art 3; PIPL, Art 3). Likewise, Art 2 of the EU’s proposed Artificial Intelligence Act (AIA) expressly covers AI systems deployed in, or whose outputs are used, in the EU, regardless of where the providers and users of the system are. By contrast, the CCPA applies primarily to large businesses which “do business in” the state (CCPA, §1798.140). Thus, ML researchers and practitioners worldwide are now subject to foreign, and increasingly complex Koops (2014), data laws. Below I non-exhaustively review key legal obligations they impose. Note that “legal” here refers only to formal law. This distinguishes my scope from (no less important) work on “ethical”, “fair” or “responsible” AI Paullada et al. (2020); Hutchinson et al. (2021); Rogers et al. (2021). Despite clear overlaps, neither is a subset of the other. To illustrate, for some in certain states abortion is ethical yet illegal, for others elsewhere it is unethical yet legal.

2.1 Collection

Most centrally, data protection laws require informed consent before “personal” data may be obtained (GDPR, Arts 6–11; PIPL, Arts 13–17). The CCPA does not expressly require “consent”, but businesses must inform consumers of the scope and purposes of data collected before collection (CCPA, §1798.100). The legality of numerous facial recognition datasets has been challenged for lack of consent

noa (2021); Paullada et al. (2020); O’Brien and Ortutay (2021). Facial recognition clearly involves personal data because the task is to identify. “Personal data” is, however, wider. Article 4 GDPR defines it as “any information relating to an identified or identifiable natural person”. One is “identifiable” when they may be identified directly (i.e. by name) or indirectly. Names are not necessary; zip codes, gender, race, etc, could collectively identify. Indeed, Wong suggests that the EU’s definition of personal data “appears to be capable of encompassing all information in its ambit”, as EU courts have taken “personal” to include not only data about a person, but also data which affects them Wong (2019).

The breadth of data laws explains why although most of ImageNet’s

Deng et al. (2009) label classes do not target persons, its caretakers recently blurred out all human faces in the data, citing privacy concerns Knight (2021). A similar fate appears to have befallen the new Meta’s facial recognition systems O’Brien and Ortutay (2021). While most legal scrutiny has been on images, text, sound, and other modalities can also be “personal”. One’s forum posts, even if pseudonymous, could reveal much of their background. As such, dataset builders should be deliberate about obtaining consent even (or especially) when it is not obvious if the data is “personal”.

2.2 Retention

A standard feature of legal consent is that consent may be withdrawn at any time. Data subjects may request to correct or erase their data (GDPR, Arts 16–17; PIPL, Arts 15, 16, 44–47; CCPA, §1798.105–106). Beyond consent, data controllers are also obliged to keep data in personally-identifiable form for no longer than necessary for its stated purposes (GDPR, Art 5(1)(e); CCPA, §1798.100(3)). Data that has served these purposes (say, the model has been trained) must be deleted or anonymized. However, given that data previously collected for one purpose can turn out useful for another, deletion may be quite undesirable for ML engineers. Anonymization is not much better, since preventing re-identification may require destroying most of a dataset’s informative signals Rocher et al. (2019); Xu and Zhang (2021).

As such, prior thought should be given to delineating, and communicating, what the data will be used for. Conveying a specific purpose such as “training ML models” may not cover maintaining or updating the model post-deployment. Too general a purpose, such as “for ML processing” invites user suspicion and may fall outside the legal requirement that consent must be given in respect of “specific” purposes (GDPR, Art 6(1)(a); PIPL, Art 6).

2.3 Processing

Consent obligations surrounding data collection apply equally to data use. Further, data subjects have a right to be informed of and object to decisions “based solely on automated processing” (GDPR, Art 22; see also PIPL, Art 24). This legally advantages human-in-the-loop systems. Beyond data protection laws, anti-discrimination laws in certain jurisdictions (e.g. US disparate treatment/impact laws; UK’s Equality Act 2010) may prohibit the use of protected attributes like race and gender for profiling Hellman (2020). This restricts the feature set which can legally be used for training ML models. Features highly co-linear with protected attributes may be indirectly prohibited as well.

While the obligations above may be more relevant to models than to datasets, laws can target the latter directly. Most prominently, the proposed Art 10 AIA stipulates that “[t]raining, validation and testing data sets” to be used in what the Act identifies as “high-risk AI systems” shall be “relevant, representative, free of errors and complete”. This extends to having “appropriate statistical properties” regarding the system’s target persons, and considering characteristics “particular to the … setting within which the high-risk AI system is intended to be used”. The draft AIA is in early stages and may take years and numerous amendments to come into force (if it does). Should it become law as is, it may effectively render data-centricity legally mandatory for “high-risk” AI. Examples of high-risk AI enumerated in Annex III AIA non-exhaustively include systems for biometric identification, educational assessments, recruitment, credit scoring, law enforcement, and judicial decisions.

2.4 Sharing and disclosure

As data sharing or disclosure also constitutes processing, unauthorized disclosure is also a breach (GPDR, Arts 4(2); PIPL, Art 25). Thus, datasets with potentially personal information cannot be open-sourced without proper anonymization, even for research purposes. Another concern particular to large neural networks is the possibility that the network may memorize and leak personal information in the training data

Nasr et al. (2019); Chen et al. (2020). Personal information in datasets used for training large neural networks may therefore need to be removed before training.

2.5 Research exemptions

The above obligations may be subject to limited research exemptions whose scope differs across jurisdictions Mabel and Tara (2019)

. For instance, both the GDPR and the CCPA regard subsequent scientific or statistical research as compatible with the initial purposes of the data collection for which consent was presumably obtained. This allows research to proceed without needing to ask for additional consent, subject to appropriate safeguards (GDPR, Arts 1(b) & 89(1); CCPA, §1798.140(s)). This may have been sufficient to cover research applications of the ImageNet data (discussed above) without requiring anonymization. China’s PIPL, however, does not have such an exemption.

3 Implications on ML pipelines

There is, in short, an expanding range of legal constraints on when and how data may be used. This has obvious implications for the ML community. Since legal data is necessarily a subset of all data, prioritizing legality seems to require sacrificing model performance. But less data is not always worse, especially if it also means less noise. More formally, if we think of ML broadly as seeking , where the hypothesis takes weights learned from features in dataset , and is a performance metric measured against (holdout) truth labels , then legality constraints might be understood as follows:

with and respectively denoting the legally-permissible set of features and datasets.

Formally framing the problem as such identifies three situations where legal constraints may not necessarily limit model performance. For brevity, we illustrate this with , though similar logic applies with . First, if (all data relevant to a task). That is, data laws have no practical effect on datasets in that area. For example, the task involves only cat detection and never implicates personal data. Second, if , the theoretically optimal dataset, happens to be perfectly legal so that . Third, and least obviously, if the legally-constrained optimization problem produces better performance than its unconstrained variant. This counter-intuitive result may occur in the real world because the law may force one to exclude noisy data (or features) that would otherwise have been included. In this sense, data laws, like other optimization constraints, may turn out to have a useful regularizing effect.

Moreover, in practice is not the only metric to be optimized. Even assuming we care solely about economic value, profits, while correlated with (F1 score) performance, turn also on variables like user adoption and trust Toreini et al. (2020); Gillath et al. (2021); Kerasidou et al. (2021)

. A perfectly accurate classifier that is never used generates no revenue. Fines also reduce profit. Ignoring legality provides another source of “hidden debt” in ML pipelines

Sculley et al. (2015). Thus. early investments in processes and practices for making legal datasets could yield better real world performance, particularly in the long-run (where legal enforcement becomes more feasible). Apart from any obligation to follow the law just because it is law, there are practical reasons why the ML community should do so.

4 Building legal datasets

Complying with the intricate and growing web of data laws is non-trivial. The challenge is how we might turn motherhood calls for “multi-disciplinary collaboration” into actionable steps for ML researchers. The rise of ethics guidelines and responsible AI checklists Zook et al. (2017); Rogers et al. (2021)

offers one solution. In a sense, this involves, ethicists, sociologists, etc pre-computing complex, open-ended obligations into simpler, close-ended compliance heuristics for computer scientists. Following this trend, Figure

1 offers a framework for thinking about dataset legality. This builds on existing work that already incorporates some legal principles (e.g. Rogers et al. (2021)) but differs in two ways. First, the framework focuses more on legality and thus complements responsible/ethical AI work. Second, while checklists and impact statements are generally backward-looking, encouraging researchers to justify choices already made, these considerations are forward-looking, encouraging researchers to think about legality at each stage of the ML process. This is crucial because legal errors, especially the need to obtain informed consent for processing, are expensive to rectify post facto if not avoided ex ante.

Before Collection — Which Law(s) Apply? 1. Will directly or indirectly personally-identifying data (“personal data”) be collected? Note the wide scope of “personal data” (see Part 2.1). 2. If personal data is collected, which countries are we (the data processors) in? What data laws do these countries have? 3. What countries are the data subjects in? Do these countries have extra-territorial data laws? 4. What is the data for? Are these purposes (e.g. research) subject to any exemptions? During Collection — Obtaining Consent 1. Are subjects adequately informed of the purposes of and consent to data collection? Adequate “consent” varies by country, but refer generally to GDPR, Arts 6(1)(a) & 7. 2. Ensure that communicated purposes truly cover all intended data uses. Purposes may be stated more broadly to leave room for future uses, but should not be framed too widely either. After Collection — Data Use and Model Training 1. Ensure that subjects who withdraw consent are removed from the dataset. Data pipelines should be robust to consent withdrawals. Already-generated artefacts should be re-generated. 2. Check that protected attributes (and strong correlates) are excluded from the feature set. After Training —- Deployment 1. If the system is entirely end-to-end with no humans-in-the-loop, subjects must be informed. 2. If the communicated purposes of data collection have been spent, the data may need to be anonymized or deleted. 3. Does the trained model leak personal data? If so, either tweak the model and its outputs, or obtain consent for data disclosure.

Figure 1: Suggested framework for dataset legality

5 Limitations

All heuristics are wrong, but some are useful Box (1976). This paper does not cover all the legal obligations, duties, and exemptions affecting ML datasets. Nor can following the proposed framework completely guarantee legality (nor fairness or morality). Indeed, the corpus of legislation affecting ML datasets is set to grow, amid concerns that current laws offer insufficient safeguards Wachter (2019); United Nations High Commissioner for Human Rights (2021). The draft AIA, if and when passed, would significantly alter the AI and data governance landscape; other jurisdictions may follow suit with their own Acts. The minutiae of dataset legality should be fleshed out in future, lengthier work. The primary aim here is to spark discussion on when and why legal data is better data. Data-centric AI presents an opportunity for the ML community to build better datasets — in all the technical, statistical, ethical, and legal senses of the word.

The authors disclose no funding sources nor competing interests.