Approaching Ethical Guidelines for Data Scientists

by   Ursula Garzcarek, et al.

The goal of this article is to inspire data scientists to participate in the debate on the impact that their professional work has on society, and to become active in public debates on the digital world as data science professionals. How do ethical principles (e.g., fairness, justice, beneficence, and non-maleficence) relate to our professional lives? What lies in our responsibility as professionals by our expertise in the field? More specifically this article makes an appeal to statisticians to join that debate, and to be part of the community that establishes data science as a proper profession in the sense of Airaksinen, a philosopher working on professional ethics. As we will argue, data science has one of its roots in statistics and extends beyond it. To shape the future of statistics, and to take responsibility for the statistical contributions to data science, statisticians should actively engage in the discussions. First the term data science is defined, and the technical changes that have led to a strong influence of data science on society are outlined. Next the systematic approach from CNIL is introduced. Prominent examples are given for ethical issues arising from the work of data scientists. Further we provide reasons why data scientists should engage in shaping morality around and to formulate codes of conduct and codes of practice for data science. Next we present established ethical guidelines for the related fields of statistics and computing machinery. Thereafter necessary steps in the community to develop professional ethics for data science are described. Finally we give our starting statement for the debate: Data science is in the focal point of current societal development. Without becoming a profession with professional ethics, data science will fail in building trust in its interaction with and its much needed contributions to society!


page 1

page 2

page 3

page 4


Data, Science and Society

Reflections on the Concept of Data and its Implications for Science and ...

Data Science as Political Action: Grounding Data Science in a Politics of Justice

In response to recent controversies, the field of data science has rushe...

No computation without representation: Avoiding data and algorithm biases through diversity

The emergence and growth of research on issues of ethics in AI, and in p...

Coercion, Consent, and Participation in Citizen Science

Throughout history, everyday people have contributed to science through ...

A Clinical Approach to Training Effective Data Scientists

Like medicine, psychology, or education, data science is fundamentally a...

A few statistical principles for data science

In any other circumstance, it might make sense to define the extent of t...

The Content of Statistics and Data Science Collaborations: the QQQ Framework

For today's applied statisticians and data scientists, collaboration is ...

1 Definition of data science, the roots and the changing role

We start with the definition of data science as given by Donoho which we find very useful. We will describe how data science relates to statistics and machine learning and why the role of a data scientists in society is becoming increasingly important.

1.1 Definition of data science

There is currently no generally agreed definition of data science. Here we use the definition of Donoho [1] of greater data science:
Data science is the science of learning from data; it studies the methods involved in the analysis and processing of data and proposes technology to improve methods in an evidence-based manner. The scope and impact of this science will expand enormously in coming decades as scientific data and data about science itself become ubiquitously available.

Donoho also provides a classification of the related activities into six divisions:

  1. Data gathering, preparation, and exploration,

  2. data representation and transformation,

  3. computing with data,

  4. data modeling,

  5. data visualization and presentation,

  6. science about data science.

Items 1 to 5 describe the work of a data scientist, item 6 differentiates what he calls greater data science from data science.

1.2 Relation of data science, statistics and artificial intelligence

The lack of an agreed definition of data science is a symptom of a larger problem: it is not (yet) a profession of its own. Some see it as subdivision of machine learning, and thus a subdivision of artificial intelligence, others as subdivision of statistics, that is exploratory statistics, and many see it as a collection of methods from both statistics and machine learning, used by people of different professional backgrounds, or people with no actual professional background only trained in the application of those methods, without the necessary formal scientific education. By starting with the definition of Donoho (sec.

1.1) we already make two statements:

  1. Data science should become a profession in the sense of Airaksinen [28], with a definition, a grounding in science, and a task and responsibility in society, and

  2. exploratory statistics is a historical predecessor of data science.

With respect to the second point, we do not claim exploratory statistics to be the only predecessor of data science. With the same right, people from the artificial intelligence community can see machine learning as a historical predecessor of data science. Therefore, we want the machine learning and artificial intelligence community to work together with the statistics community on the first point.

1.3 The changes of the societal impact of data science

The biggest, relatively recent changes in practical data science are the availability of vast amount of data together with the increase in computational power. Technically speaking this enables fast, low-cost processing of ever-changing large data bases by algorithms to derive continuously updated highly condensed and aggregated data, i.e. results. These results can be fed into human decision making, that is based on the interpretation and understanding of the results, or they can be used in rules for automatic decision making. Whether or not, at least interim, the decisions are made with human understanding of the results and how they were generated, distinguishes black-box algorithms from other algorithms.

Focus of this article are the consequences of processing and analysing vast amounts of data about humans and human behaviour. Todays possibilities in these respects change human interaction and thus society directly and fundamentally. Examples for this broad claim will be given in subsequent sections.

As data science is the focal point of these developments the role of data scientists in society becomes more influential and important. With increased influence and importance comes increased responsibility.

2 Ethical issues in data science

The awareness that data science and its algorithms have an increased and fundamental impact on society is vivid around the world. There are ongoing or starting discussions in many countries and organisations in legal and political context, actually too many to cite. Instead, we refer to any search in news portals, social media and internet with terms as algorithm, impact, society.

Actually such considerations are not really new. To our knowledge, the first data science application recognised to have a large impact on societal processes are election forecasts and polls on voting behaviour. Many countries have thus regulations on what is allowed to publish when in context of an upcoming or ongoing elections. An overview over such regulations is given in [18].

A systematic approach to identify, describe and categorise those ethical issues was undertaken by CNIL (Commission nationale de l’informatique et des libertés) in 2017 [38, 39]. The report is the result of a public debate organized by the french data protection authority. We will follow its structure and give examples for each of the given categories of ethical issues to make them tangible. The main points relevant for consideration by data scientists are identified.

2.1 Six main ethical issues according to CNIL

In the debate six main issues were identified. Citations referencing [38] are given in front of each of the following sections. These ciatations are set in italics to be easily identifiable.

2.1.1 Autonomous machines: a threat to free will and responsibility?

Delegation of complex and critical decisions and tasks to machines increases the human capacity to act and poses a threat to human autonomy and free will and may water down responsibilities.

The most widely discussed application of this type are autonomous vehicles. Autonomous vehicles have the potential to increase traffic safety, but who is responsible for remaining accidents? Will it be possible to overrule a machine’s decision on lowest or allowable risk, i.e. in case of an emergency.

On a more abstract level any sufficiently complex system may be called an autonomous machine.

Already today many Kafkaesque situations arise due to complex semi-automatic regulations, i.e. the story of a man who was released from his job by an algorithm due to an error, and no human was able to stop that procedure [11] after the lay-off was triggered.

It must be noted, that in these settings the data scientist is not involved directly. May be she or he built some model in preparation to steer the machine, but the implementation generally was not her or his task.

2.1.2 Bias, discrimination and exclusion

Algorithms and artificial intelligence can create biases, discrimination or even exclusion towards individuals and groups of people

General remarks

This issue is one where data science expertise is very important for understanding the extent of the problem. We start stressing one point that is often overlooked, when algorithmic bias is discussed. The very nature of the most commonly applied algorithms, -called pattern recognition or classification and clustering-, if applied to humans, is applying

prejudice. In statistical language they form a prior belief on an individual generated by experience with other individuals assigned to the same group. Goal of these algorithms is the assignment of a new object, in this case a person, according to some measured characteristics of this person into some group. Judgements and predictions on e.g. future behaviour or reactions to a medical treatment for the individual are then made according to previously observed behaviours or reactions of the other’s in the group. Obviously, if this leads to an improved medical decision making, this is to the benefit to the individual and the society at large.

In many examples, though, there is a possible benefit to some and a negative impact on others. In those cases, questions of fairness and justice are touched by the use of these algorithms for judgement/prediction and decision making in general. Any of their use constitute bias, if the measured characteristics, that lead to the assignment into the group, are only correlated but not causally related to the features that are judged about. Formally the reason is, that the relationship between what is predicted or judged about for the individual and the measured characteristic of the individual is conditionally independent given the individual. Note that this bias is created independently on whether or not the underlying database is representative for the larger population for the measured characteristics. The bias is created by applying an approach (= data + method) that is suitable for correlational analyses only for judgements that require causal reasoning on individual level.

Practically this is not different from humans basing their judgement on a person, on experiences (= data) they have made with other people that are alike based on some arbitrary (that is bearing no causal relationship) assessment on similarity. If this is implemented by an algorithm the impact can be more severe, as the identical bias is applied to more people and forms a more systematic bias towards certain groups. Combined with monopolies on data ownership, - like currently for social media or search data -, and with the scalability of computing power such a systematic bias can easily become a universal norm. Where the algorithm uses characteristics that include or are related to protected characteristics by anti-discrimination laws (mostly race, sexual orientation, religion or belief, age and disability) any judgement and any decision based on the algorithm constitute instances of discrimination, when they result in one person being treated less favourably than another in a comparable situation.

This does not happen only in badly designed or malfunctioning systems. It is in the core of all classification applied to people.

Another, -practically incurable-, drawback of those algorithms is that they infer from data of the past, - on the members of the group and/or the individual on which one wants to judge -, and human behaviour on an individual level and their patterns do change over time.


The probably most famous example is COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) a software used in the US judicial system to classify the probability of defendants’ recidivism. A good discussion of the approach can be found in

[27]. It was shown in a detailed analysis [5, 6] that the privately owned algorithm used in the juridical system gave far better prognoses for white than for black people, thus it discriminated implicitly based on color. The machine generated prognosis was intended just to help the judges, but in interviews it could be seen, that it played a crucial rule in the judgements. Especially decisions by the judges whether defendants could get out on parole or had to go to jail were strongly influenced by the algorithm’s output and discriminated against black people.

It must be stressed, that this bias in application was not intentional as far as it is known. The bias most probably was introduced through available data on prisoners in conjunction with the above described fundamental misunderstanding that observed correlations would be good enough to make decisions that require causal reasoning.

Examples of the application of algorithms are not restricted to the US. In Europe for example there is a recent initiative in Austria to classify unemployed people in one of the three possible groups: bad (¡= 25%), mediocre or good chances (¿=66%) to be employed for at least 6 months in 24 months from now [19]. The idea is to spend money to bring people back into the workforce more on target. Controversial is the stated goal to spend less money on those in the lowest group. It is reported that age and nationality increase one’s probability to be put in the lowest group. Both points seem to be openly discriminating. The official stance is, that the algorithm does not decide, but only helps a human to decide and therefore no discrimination would happen. This is ignoring to the large influence that those supportive systems have, when there is a shortage of money: decision makers typically need to justify, if they deviate from the algorithmic choices, but not if they follow the machine’s decision. The default mode of operation may change through the use of such a simple helper algorithm.
A very similar system is already in use in Poland [20].

In the examples given, in addition to generating bias, the automatic classifiers act like self-fulfilling prophecies. The automatic, even secret, classification of an individual will influence his or her future life, in the direction the chosen algorithm determines. At the same time it becomes impossible to assess the algorithms performance in the future, as the future of the individual’s life is changed based on the algorithms outcome and there is no control group.

Also the algorithms act very similar to ancient oracles. For an outsider it is impossible to find out which characteristics of a person exactly have led to the given classification. They are black-box algorithms, a feature shared by many of the algorithms from the artificial intelligence community. There only is the saying of the oracle, no reasoning, and no possible recourse. Black-box algorithms therefore will always be problematic for usage in any juridical system or for any scoring implying a value judgement of an individual, i.e.  credit scoring.

These applications are examples for applications where some people have a benefit and others negative consequences from the application of the algorithm. It is accepted, that the application may be not in the interest of the individual that is judged.

Of course this is not a drawback inherent in using algorithmic decision making. It is possible to set up procedures with no intention to inflict negative consequences on some to the benefit of others, if care is given to transparency and possible discriminating behaviours. For example in Germany there exists a program RADAR-iTE (Regelbasierte Analyse potentiell destruktiver Täter zur Einschätzung des akuten Risikos - islamistischer Terrorismus) [2] where an algorithm is used to try identifying the more dangerous people in a group of people already under investigation by law enforcement.

Decisions are based on a set of 72 questions which are transparent for anybody involved. Because those under inspection by RADAR-iTE already are under investigation, the most important aspect of its application is resource allocation by law enforcement. There is no additional negative effect on those individuals that are judged to be high risk beyond being under investigation already. Publicized numbers [3] give around half (96 of 205) of the suspects are considered low risk after classification by RADAR-iTE, only around 40% (82 of 205) are considered high risk. Transparency of all steps seems guaranteed throughout all decisions performed with respect to algorithmic classifications.

In this case those applying the algorithms and those being judged share in some sense the goal to reduce the number of individuals that are observed. The application of the algorithm has the potential to help an individual by being removed from the group of high risk people.

The implications of a similar algorithm if it was applied to screen the overall population would lead to a completely different asessment. Technically, there is no barrier to such a use. It can only be prevented by morality and law.

2.1.3 Algorithmic profiling

Personalizing versus collective benefits: Individuals have gained a great deal from profiling and ever finer segmentation. This mindset of personalising can affect the key collective principles like democratic and cultural pluralism and risk-sharing in the realm of insurance.

The most discussed form of personalizing in the age of the internet is the so-called filter bubble [36]. The scandal around Cambridge Analytica using Facebook data for micro-targeting a very specific subset of the public with the aim to influence the US elections in 2016 made the dangers of highly personal news and marketing feeds obvious [7, 8].

As a reaction the legislative started to formulate laws to reduce the risks of such personalized targeting with fabricated news, i.e. in Germany the “Netzwerkdurchsetzungsgesetz” [9]. Facebook restricted the admission to personal data for third parties in the aftermath of that scandal [10].

A data scientists role, if implementing schemes for targeting specific sub-population identified by profiling with the help of the vast amount of information available on each active person in the internet, should at least be to warn of possible misuse. She or he should understand the dangers for society and only help to implement lawful or ethical algorithms.

A nice example for the second point on risk-sharing are telemetry data collected by so-called smart devices and transmitted to insurance companies. Since the beginning of 2018 each new automobile in the EU has to record telemetry data in a system called eCall [21]. While that system will only transfer data in case of an emergency, there are systems that collect lots of information about all aspects of car usage, down to location and the music the driver listens to [4]. First there are obvious problems with privacy, if there can be unlawful information sharing. The second problem here are insurance companies who try to give personalized policy premiums based on level of data sharing a car owner accepts. Probably even more problematic are health data, which can be accessed by insurance companies [15].

While at first nothing seems at stake if an unhealthy living style is punished with higher policy costs, a second look reveals that the fundamental principle of an insurance, namely risk sharing among a large group, is eroded. In addition there is a direct conflict of personalized insurance policies and personal freedom. Big monetary pressure on customers to live a good live in the sense of the insurance companies must be expected.

2.1.4 Preventing massive files while enhancing AI: seeking a new balance

Artificial intelligence by being based on advanced techniques of machine learning requires a significant amount of data. Still, data protection laws are rooted in the belief that individuals’ rights regarding their personal data must be protected and thus prevent the creation of massive files. AI brings up many hopes: to what extent the balance chosen by the lawmaker and applied until now should be renegotiated?

A field of research that is already very experienced and advanced in using large databases on humans and trying to find ways to make that balance is the medical field. Thus the following two examples are able to illustrate the benefits of the availability of collected personal data and how the risks for individuals regarding their privacy or for the society regarding fair access to information were mitigated.

In July 2018 some valsartan products were discovered to have been contaminated with N-nitrosodimethylamine (NDMA). In September 2018 an expedited assessment of cancer risk associated with exposure to NDMA through contaminated valsartan products could be published [30], providing reassuring interim evidence that the short term overall risk of cancer in users of valsartan contaminated with NDMA was not markedly increased. This fast assessment in a relatively large cohort (5150 Danish patients) was possible by linking data from four official Danish registries on individual level thus collecting information on prescriptions, cancer diagnosis hospital admissions, mortality and migration. Privacy was implemented by a process where officials from the registries perform the linking, derive the important information, and then de-identify the data before it is sent to the scientists.

In 2018 the German health insurance company DAK Gesundheit in cooperation with scientists from the University of Bielefeld published a report on the health status and the health costs of children and adolescents based on the claims database from the people insured with the DAK Gesundheit [33]. Next to some general overview on the health status, a key topic was the investigation of the influence of socioeconomic status and education of the parents on the health and induced health costs of the children. The main conclusion is that education is a stronger influencing factor than socioeconomic status and that important preventive measures consist of giving children good health education. In the same report, and by guest authors [34], also the results from the KiGGS study [35] are discussed. That study puts its emphasis more on the principle of equal opportunity and the influence of socioeconomic status on general health and specifically mental health. Publishing this together shows sensitivity of the topic in the political debate and the role that an open scientific environment has to play.

Both, the valsartan case and the DAK study show that there are true benefits for public health that can be generated from using large medical databases. When balancing these benefits with the risk for privacy violations for the people whose data is used, in the valsartan case, we want to highlight the high trust from the citizens that is given to officials: if data on any medical problem one encounters in life can be linked to the home address, citizens need to trust the government that this data is not accessible or made accessible to anyone that uses this information with other than the best intentions. With the DAK study we want to highlight another important aspect of balancing benefit-risk: the ownership of data, and fair access to data. Data is the new oil, and evidence generation shapes how benefit is defined and how it is implemented. Thus, if risk is shared by people of all political opinions, then fairness requires that evidence generation is possible for people from different political opinions.

In general, an important measure for respecting privacy is to de-identify data in the databases, and making them non-identifiable. Guidelines exist for de-identification processes (e.g. the Safe Harbor method [32]), yet, with growing databases through social media use and genetic and biomarker research, non-identifiability is a moving target. A good counter-measure is implemented in the process for requesting access in the so called MIMIC-III database [31] on critical care unit patients. In addition to a required training on data privacy, and a strict de-identification of the data, all scientists accessing the data have to submit a data use agreement with 10 points, among which there is one requiring the scientists take immediate action should they realize that there is a way to de-identify data. This is acknowledging the fact that de-identification is no guarantee to de-identifiability at all times by installing a process to monitor de-identifiability by those who have the expertise and knowledge, namely the data scientists, holding them responsible for it and giving them, as a community, a general credit of trust.

2.1.5 Quality, quantity, relevance: the challenges of data curated for AI

The acceptance of the existence of potential bias in datasets curated to train algorithms is of paramount importance.

Even if implemented in best of mind, there may be unexpected bias in the training data going beyond what has already been said about bias in Section 2.1.2. There are many examples to find, we want to give two.

One famous example of algorithmic training going wrong was Microsoft’s twitter bot Tay [13]. Tay was implemented to act on Twitter as a regular user. The bot should learn from the comments by others how to perform common twitter conversations. In less than a day the humans had learned how to manipulate the learning algorithm in such a way that Tay started to speak out fascistic and racist paroles. Microsoft decided to take Tay offline less than a day after it started learning.

A recent example for a similar event is an AI system at amazon. That system should help to find the most qualified applicants in their huge stream of applications. The experiment had to be stopped, when it was noted that the algorithm systematically downgraded applications of women. In [12] some probable causes for that behaviour are given. The training data contained mostly applications of men, so most of the successful applicants were men. There are not too many details, but as a consequence any appearance of the word woman reduced the chances of that applications.

Finally the whole project was stopped, even after the developing team tried to correct for known shortcomings, because there was no guarantee the machine would not devise ways to discriminate in other ways [12].

The important observation in both cases is, that these black-box algorithms couldn’t be improved. They had to be taken offline and completely replaced. As an obvious consequence such algorithms should not be used, where such a replacement is complicated or dangerous.

2.1.6 Human identity before the challenge of artificial intelligence

Hybridisation between humans and machines challenges the notion of our human uniqueness. How should we view the new class of objects, humanoid robots, which are likely to arouse emotional responses and attachment in humans?

This point from the debate in France run by CNIL is given only for the sake of completeness. At the moment, we do not believe that this is an ethical issue where data scientists have a special responsibility due to their expertise.

2.2 Conclusion from CNIL’s report

The given examples show the multitude of complex ethical issues that arise from a data scientist’s work. In the next section we argue that ethical guidelines for data scientists are one mean to help them taking their responsibility.

3 Guidance for data science

The call for more guidance for digital technologies in media in general is loud and all across the globe, leading to various initiatives and groups engaging in discussions around ethical rules for developing and implementing those technologies. For an overview on initiatives and ethical values in the tech field visit the website of the think tank doteveryone [16] or the blog of Erickson [17]. There is a long history of computer scientists discussing the ethics of algorithms. A good starting point is the website Here fatml is an acronym for Fairness, Accountability, and Transparency in Machine Learning and stands for a series of conferences. For the german speaking communities, we recommend the slides to the one day workshop Ethische Leitlinien wissenschaftlicher Fachgesellschaften of the Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS) [14] or the Algorithmic Accountability Lab (AAL) at the University of Kaiserslautern AAL provides a good source for current discussions not specific for data scientists but about the use of algorithms in general with some hints toward data science.

This article is in that sense, one contribution among many. Its main purpose is to broaden the audience and increase the number of participants in the discussions, and to foster the development of morality, a set of deeply held, widely shared, and relatively stable values [37] on data science within and around the data science community. As any ethical guidance, be it in form of codes, oaths, and even law, only has the intended impact, if people are willing to follow it, and the chance for that is high, if the underlying norms and values are in accordance with, in this case, the data science community’s own morality.

3.1 Do we really need more ethical guidelines?

Not everyone would agree that data scientists need more guidance how to make moral decisions in their professional life: many do work in companies with codes of conduct, work for institutions that require some oath, or are members of scientific societies that give ethical guidelines to their members, or have religious beliefs that give guidance to wrong or right in their life, and there is the fundamentally skeptical view that paper does not blush. Also, we are all obliged to obey to law. So what does a special set of ethical rules for the profession of data science add?

Four rationales:

  1. For the individual data scientist, the translation from very general ethical principles from common morality, law or religion, to an ethical issue at work can be quite difficult. Especially since most issues are not about intentions, but about the consequences of one’s work. Those consequences are often not very easy to judge upon. Having some reference to well-thought through and well-reasoned guidelines in that sense is not more nor less than having publications on specific methods: it helps to avoid re-inventing the wheel ever so often. In addition, it can be very helpful to have such a reference along with the reasoning for justification, if the consequences of an ethical decision increase the workload for a colleague or costs for an employer or client.

  2. For data scientists as a community, having formulated codes of conduct or some service ideal makes the difference of acting as professionals or merely having a job that does data crunching. In sociology, a profession is defined by means of professionalism. This implies that a profession has a certain degree of autonomy in society, its members’ expertise is based on science, and the professional work exemplifies a service ideal [28]. In other words: without a service ideal, there is no professionalism and without professionalism, there is no profession.

  3. For data scientists as members of society, for their clients, employers and colleagues, written rules of conduct for data science services can help to establish a relationship of trust. If they are written clearly, they give lay people some mean to know what to expect from a data scientist, to compare what they are getting against that standard, and finally gain trust if the expectations are met. Being trusted as a professional increases social status, reputation and possibly the money that is paid for the service.

    A code of conduct or ethical guidelines may even be the start of a well defined job definition for data scientist!

  4. In case of conflicts of interests an ethical guideline under the maintainership of some professional society may offer an arbitration process between different interests.

4 Existing guidelines and codes

In the previous section, we provided references to ongoing efforts to develop ethical guidelines to data science itself and connected scientific or technical fields. Here, we want to give more details on the three main guidelines from the fields of statistics and computer sciences from some of the largest and oldest established associations for those communities. If one could establish additional sub-guidelines that filled the gaps with respect to data science aspects, the audience would immediately be very large, and there would be no need to establish a new association. Both, ACM and ASA, acknowledge data science as an important field in their domains.

4.1 ASA: Ethical guideline for statistical practice

The American Statistical Association was founded in Boston in 1839 and has more than 19000 members worldwide. The current Ethical Guidelines [23] have been updated and approved by the ASA Board in April 2018. The guideline has eight sections, six of which describe the responsibilities towards individuals and groups of people to which the statistical work may matter:

  • Professional integrity and accountability,

  • integrity of data and methods,

  • responsibilities to science/public/funder/client,

  • responsibilities to research subjects,

  • responsibilities to research team colleagues,

  • responsibilities to other statisticians or statistics practitioners,

  • responsibilities regarding allegations of misconduct,

  • responsibilities of employers, including organizations, individuals, attorneys, or other clients employing statistical practitioners.

Checking which of the ethical issues discussed in Section 2.1 are covered, one recognises, that implicitly, it is a clear call for human responsibility addressing the issue raised on autonomous machines (Section 2.1.1). It only touches very briefly on the risk, that information presented as aggregates on groups may lead to bias, discrimination and exclusion (Section 2.1.2). It sets high standards for privacy and respecting data confidentiality (Section 2.1.4). With the integrity of data and methods section and throughout almost any other point, it gives clear guidance on quality, quantity, and relevance of data, and to a general notion of scientific honesty. It also addresses ethical issues specific to human studies, not covered in section 2.1, but very relevant to all scientists working in that field. The guidelines have gaps concerning those ethical issues that result from the implementation of statistical procedures into daily practice. Missing are discussions on all ethical issues that can arise from implementing algorithmic results without further human interaction into automatic decision making.

4.2 ACM: Code of ethics and professional conduct

The Association of Computing Machinery (ACM) was founded in 1947 and has more than 100.000 members worldwide. The ACM has ethical guidelines for a long time. The Code [24] as it is named, has just been updated and adopted by ACM in June 2018. It has a preamble, and four sections:

  1. General ethical principles,

  2. professional responsibilities,

  3. professional leadership responsibilities and

  4. compliance with the code.

On a general level The Code addresses all ethical issues that we present in Section 2.1. Yet, the Code is not a code for data science, and it is not providing the constructive guidance ASA gives on the integrity of data and methods related to scientific honesty and on responsibilities to research subjects.

4.3 Ethical guidelines of the German Informatics Society

The German Informatics Society (GI) has a long history of its ethical guidelines [25]. The latest update was in June 2018. These guidelines are concise and consist of a preamble and 12 very short sections.

  • Sections 1 to 4 concentrate on aspects of the professional competence of computer scientists,

  • sections 5 and 6 are about individual working conditions,

  • sections 7 and 8 are about teaching and researching in the field of computer science.

  • Very interesting are sections 9, 10, and 11 which clearly state the societal responsibilities of computer scientists. We see some intercept with the work of data scientists there.

  • Finally section 13 defines a mediating role of the German Informatics Society in case of conflicts stemming from these guidelines.

There are no data science specific sections in these guidelines, nevertheless many important aspects are touched. We think the structure of the ethical guidelines of the GI can be a good skeleton to develop ethical guidelines for data science.

4.4 Conclusion from examining existing guidelines

The ethical guidelines for statisticians from the ASA are constructive and detailed for the ethical issues of statisticians and data scientists in the sense of Donoho (Section 1) that work in research and the special responsibilities towards participants in human studies. The Code of the ACM covers the area of using data from and about humans outside from human studies and issues that arise from implementing algorithms from data science for repeated use and that have impact on individuals and communities. What we have in mind is a combination of those aspects, maybe structured as in the guidelines of the GI, as data scientists work on data from all sources and across all those areas.

5 Development of ethical guidelines for data science

There are hurdles to overcome before a meaningful guideline can be established. In our view the main ones are the lack of a sense of community and a lack of communication on ethics.

5.1 Data scientists have to perceive themselves as a community

At the moment the term data scientist in not a protected professional title. Data scientists can have an academic training in statistics, or computer science, as their main fields of professional training, but also engineering, psychology, business management, or they can be trained programmers or only have been following a three-month course on data science learning Python, Julia, or R. In that sense, data science today is not a profession but only an occupation. [28]. Between the data scientists from statistics and computer science, on the ground, there is not much tension, but there are many turf battles on academic levels. So the first step would be to realize that ethical guidelines are a shared interest and to then start discussing the content within data science related societies, at conferences, in University courses, at work with colleagues.

Being a community does not mean that there is a need for a new association. A good option would be to add data science specific guidelines to those of the ACM, the ASA, and the GI. Such an approach would have the big advantage, that it would not require to first establish a new data science association. Of course the authors would like to see the european statistics societies embracing ethical issues in their agenda.

5.2 Data scientists have to overcome shyness or ignorance to discuss ethics and own moral views related to data science

In the perception of the authors it is very uncommon for data scientists to express any moral view on the work they do or on the impact their work may have for fellow people and the society at large. That might be, because only recently society and data scientists themselves have realized how much impact data science services have on individuals and communities. Maybe that is because the very nature of this impact is, to be de-personalized and it is easy to overlook one’s own responsibility. Maybe it is because most people in data sciences are coming from a mathematical, technical, or computer science background and are in general less vocal on anything outside hard science. The places to change such culture fundamentally should be universities and colleges where data science is taught. Ethics and professional ethics should be part of the curriculum, just as inspiring critical thinking and expressing one’s views. In the meantime every data scientist can work towards that goal within her or his environment. Crucial is taking part in discussions at work in critical projects or within any community when there are e.g. discussions on the so-called digital revolution, the influence of social media, or algorithms in health care or the criminal justice system.

Talking about ethical questions must become natural for any data scientist.

6 Conclusion

We wrote this article for most parts without assuming that our views are generally shared views, or that anyone has to agree that any given specific application is good or bad. Underlying, there is an understanding that the morality of the data science community is evolving and that it is a shared task to develop it, which in turn needs open discussions. Yet, there is at least one fundamental basic moral conviction of the authors, which we have taken as a generally agreed moral principle: as a human being one has to think about possible consequences of one’s actions. That responsibility for the consequences grows with the knowledge and the potential one has to think about consequences.

Finally we want to start the the debate with a first statement:

Data science is in the focal point of current societal development. To build trust in data science and its interaction with society and to empower data science to take its resposibility for its contributions to society, data science must develop professional ethics and become a clearly defined profession!