HARK Side of Deep Learning -- From Grad Student Descent to Automated Machine Learning

Recent advancements in machine learning research, i.e., deep learning, introduced methods that excel conventional algorithms as well as humans in several complex tasks, ranging from detection of objects in images and speech recognition to playing difficult strategic games. However, the current methodology of machine learning research and consequently, implementations of the real-world applications of such algorithms, seems to have a recurring HARKing (Hypothesizing After the Results are Known) issue. In this work, we elaborate on the algorithmic, economic and social reasons and consequences of this phenomenon. We present examples from current common practices of conducting machine learning research (e.g. avoidance of reporting negative results) and failure of generalization ability of the proposed algorithms and datasets in actual real-life usage. Furthermore, a potential future trajectory of machine learning research and development from the perspective of accountable, unbiased, ethical and privacy-aware algorithmic decision making is discussed. We would like to emphasize that with this discussion we neither claim to provide an exhaustive argumentation nor blame any specific institution or individual on the raised issues. This is simply a discussion put forth by us, insiders of the machine learning field, reflecting on us.



There are no comments yet.


page 1

page 2

page 3

page 4


TorchCraft: a Library for Machine Learning Research on Real-Time Strategy Games

We present TorchCraft, a library that enables deep learning research on ...

Deep Learning in the Wild

Deep learning with neural networks is applied by an increasing number of...

Towards Ecologically Valid Research on Language User Interfaces

Language User Interfaces (LUIs) could improve human-machine interaction ...

Data and its (dis)contents: A survey of dataset development and use in machine learning research

Datasets have played a foundational role in the advancement of machine l...

Neural Algorithmic Reasoning

Algorithms have been fundamental to recent global technological advances...

Some Ethical Issues in the Review Process of Machine Learning Conferences

Recent successes in the Machine Learning community have led to a steep i...

Suicidal Ideation Detection: A Review of Machine Learning Methods and Applications

Suicide is a critical issue in the modern society. Early detection and p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hypothesizing after the results are known (HARKing) [1] occurs when researchers masquerade one or more post hoc hypotheses as a priori hypotheses. This means that instead of following a traditional hypothetico-deductive model [2], in which previous knowledge or conjecture is used to formulate hypotheses that are then tested, the researcher instead looks at the results first and then forms a post hoc hypothesis. HARKing can occur in different forms, such as constructing, retrieving or suppressing hypotheses after the results are known [1]. A number of studies in recent years have examined and discussed the incidences, causes and implications of such practices within various fields such as management, psychology as well as natural sciences [3, 4, 5].

In recent years, deep learning (DL) methods have dramatically improved the state-of-the-art (SotA) within the fields of speech recognition, visual object recognition, machine translation and several other domains such as drug discovery and genomics [6]. However, there are certain troubling trends in the current machine learning (ML) research, outlined in [7] as failure to distinguish between explanation and speculation, use of mathematics that obfuscates rather than clarifies, and misuse of language. Unfortunately, HARKing has also been one of those recurrent trends in machine learning and especially in deep learning research. Since much of such research is being eagerly applied to real-world applications in both industry and society, such issues are of utmost importance due to the wide impact of machine learning products and services across all walks of life. Transparent and reliable practices are critical when trying to combat suspicions towards new technologies, and the trust needs to be built over long period of time; as acknowledged recently even on the European Commission level [8].

Our hypothesis is that the recent explosion of advances within the fields of machine learning and in particular deep learning, as well as the hyper-competitive nature of these fields, may potentially be a dangerous breeding ground for various HARKing behaviors, the implications of which are not yet fully explored. At the very least, concerns regarding such behaviours deserve to be critically discussed from different angles so as to encourage best practices when building ML systems and algorithms. It is noted that these issues are not new by themselves. In fact, since as long as data-driven approaches and learning systems have been around, it has been critical, and sometimes difficult, to remain fully objective in analyzing results. Issues have been reported earlier for example, such as self-deception practiced by scientists; finding patterns that are not there [9, 10].

In this paper we discuss HARKing behavior from different angles:

  • Section 2 - Competitiveness in DL research leading to questionable improvements of state-of-the-art and claims of novelty.

  • Section 3 - Pressure to create reports that are favorable for publication and aversion towards negative results.

  • Section 4 - The belief that current training datasets are representative of real-world samples.

  • Section 5 - Automated machine learning

  • Section 6 - Explainability, ethics, reproducibility and more for AI systems.

2 Grad Student Descent and SotA-hacking

In a typical deep neural network model there are numerous design choices, i.e., tunable parts such as model architecture and hyper-parameters, that affect the predictive performance. Proposing a decent set of these design choices that will result in high generalization ability (relative to the other sets of choices) is difficult mainly due to two reasons. Firstly, due to the inherent non-deterministic and highly non-linear nature of neural networks, it is not trivial to deduce explicit relationships neither between the hyper-parameters and the model performance, nor between the interactions of hyper-parameters themselves. For instance, a large

batch size is key to speed up neural network training in large distributed computation infrastructures, however, significant degradation in model performance has been observed in practice when large batch sizes are employed [11]. To overcome this issue, typically, hyper-parameters belonging to the optimizer need to be tuned. Secondly, as the parameter search space increases exponentially, it is not feasible to apply exhaustive or brute-force search methods. Therefore, a significant portion of deep learning research has been focusing on engineering efficient model architectures and hyper-parameters for specific tasks.

Even though this manual discovery process has been successful for several applications (often empirically), there has been significant divergence from the traditional hypothesis-driven scientific approach in the methodology of such studies. Instead of hypothesis-forming based on theory, extensive research on previous studies and/or reflection against the existing domain knowledge, grad student descent (a cheesy pun referring to the well-known gradient descent algorithm) is applied.

Grad student descent is a type of optimization scheme in which the task of model architecture or hyper-parameter search is assigned to several graduate students, usually to be performed by trying what works and what does not. This is an iterative approach, where one starts with a baseline architecture or possibly with an earlier SotA, measures its performance and applies various modifications by trial-and-error, without a sound hypothesis. Once marginal improvements are observed, iterations of modifications continue further in that direction until a local optimum (often a publishable result) is reached and an explanation is forged. In essence, this whole process is driven by HARKing. Furthermore, this process is performed with a limited set of data, that is used and re-used again and again to find the "optimal" solution (further discussed in Section 4

). Oftentimes, final testing on a completely independent test set that has not been touched or observed at all at any moment is not performed and cross-validation is either not used or used under problematic assumptions and/or executions such as performing model tuning and estimation of model error at the same time 

[12, 13].

The abovementioned HARKing pattern, consequently, results in increased difficulty in distinguishing and identifying why a proposed method works or not. Lack of thorough hypothesis forming prior to experimentation often leads to negligence of comprehensive discussions on the results as well, especially when accompanied with comparison of a single score or metric. For instance, a recent work by Reimers and Gurevych shows that reporting a single performance score is insufficient to compare non-deterministic approaches such as neural networks [14]. Their study demonstrates that the seed value for the random number generator can result in statistically significant differences in performances of state-of-the-art methods [14].

The negative effects of HARKing are not specific to deep learning research alone, and they can be observed in research dealing with traditional machine learning methods as well. However, as the concept of state-of-the-art (a method or a set of methods that outperforms all the previously proposed methods for a given machine learning task in a certain metric such as test accuracy, inference speed, training speed etc.) has been disproportionately promoted in DL, both in academy and industry, presence of HARKing is becoming more likely to be overlooked especially if there are claims of advancing the SotA. This phenomenon has been promoting the concept of SotA-hacking and publishing of marginally SotA results without in-depth analysis or discussion, similar to p-hacking, data dredging and prevalence of marginally significant results in several other fields [15, 16, 17]. Typical examples of misleading comparisons leading to unfair or inadequate SotA claims include usage of additional training data (the common concept of transfer learning in DL), usage of data augmentation, comparison to poorly implemented baselines or ensembling of several models. Similar unjustified claims can be observed in "novelty" of proposed methods as well.

3 Chronic Allergy to Negative Results

Publication bias

, the phenomenon occurring when the probability of a scientific study being published is not independent of its results 

[18], leads to systematic difference in the findings of published tests of a claim from the findings of all tests of the same claim [19]. Often recurring as a positive outcome bias, this phenomenon has been observed in several research fields for a long time [20, 21, 22]. For example in clinical research, studies finding no difference between the study groups were less likely to be published than those with statistically significant results [20]. In fact, there is evidence of negative results being less likely to be published even if they provide corrections of errors in previous studies [23]. A similar troubling trend has been prevalent in ML/DL research and arguably HARKing exacerbates this further.

Publishing a null or negative result in the current ML researchosphere is considerably difficult due to the widespread assumption that "every positive result is scientifically more valuable, or interesting, than any negative one". This is likely even more the case in DL research because of the ever-increasing competition. For instance, the percentage of accepted papers related to deep neural networks in the

Conference on Computer Vision and Pattern Recognition (CVPR)

, one of the most prestigious in its field, has been 1%, 14% and 25% for the years 2013, 2015 and 2017, respectively [24]. Note that the amount of publication submissions to conferences and journals are increasing every year as well, e.g., the number of submissions to Annual Conference on Neural Information Processing Systems (NeurIPS) doubled from 2016 to 2018 [25]. Similar trends can be expected to be observed in research funding or scholarship applications. A research proposal is more likely to get a positive review if it builds further on "encouraging results" from previous work. There have been incentives to discuss the importance of negative results and share them in ML research [26] such as the First Workshop on Negative Results in Computer Vision in 2017 and we hope more actions towards this direction will be realized in the future.

Current outcome reporting bias in ML/DL research is generated both from the authors’ side as a reluctance to report negative results as well as the journals’ side in selecting the results worth publishing and it is not trivial to separate the extent of the two. Even in presence of a positive result, authors may not report the negative ones, thinking such reporting will devaluate their work. As stated by Nissen et al., even if authors’ behavior is the main contributor to publication bias (there is evidence supporting this in other fields [27, 28]), they may simply be responding to the editorial preferences for positive results [29]. The lack of traditional hypothesis construction before conducting the experiments and the lack of expectation to do so, supports the incentive of avoiding reporting of negative results in ML/DL field.

There are several consequences of such allergy against negative results in deep learning research. First, it eventually creates a bias against disruptive innovative ideas and favors incremental tweaks on well-established methods. Secondly, when negative results are not reported or published, it is essentially more difficult to construct causality and elaborate on the phenomena behind the positive results. As in other aspects of life, after all, we learn from negative results as well as positive ones. Furthermore, it increases the waste of resources and efforts due to unnecessary (re-)implementation of methods that have been shown to be inferior but never reported. Finally, the probability of a negative result being caused simply because of poor implementation exhibits the potential of that work being influential once implemented properly.

The trend of starting from a solution (often somebody else’s) instead of from the problem itself and HARKing after minor modifications can be changed by changing our paradigm of publication process. Hereby, we propose a results-blind review process for ML/DL research:

  • A paper is submitted with a clear hypothesis accompanied with the design of experiments. The hypothesis can be based on extensive analysis of previous studies, mathematical theory with unambiguous assumptions and/or domain knowledge of the specific field.

  • The paper gets peer reviewed, preferably double-blind, and the reviewers suggest modifications and improvements on the experimental methods.

  • Once accepted, the experiments are run.

  • The paper gets published regardless of the results with a comprehensive discussion section.

This approach would increase the likelihood of the study to be informative and influential regardless of the outcome, not only in the case of positive results. Essentially, the review process will give more attention to the experimental design and the hypothesis behind the proposed methods, decreasing the incentive for HARKing significantly. Naturally, this will also encourage researchers to navigate outside the "marginal improvements over the previous SotA" thinking. Similar ideas have been discussed especially in the field of psychology [30, 31, 32]. Note that we do not claim that the abovementioned proposal is applicable for every machine learning research publication process, mostly due to the scarcity of high quality reviewers. Nevertheless, we believe such discussions are beneficial and may eventually lead to improvements that will decrease the prevalence of HARKing in ML/DL research.

4 "In The Wild" Illusion

Numerous studies in the field of deep learning utilize publicly available annotated datasets for computer vision, natural language processing, audio analysis and various other tasks. Several of these datasets even include the phrase "In the Wild" in their name - an expression to convey the message that the dataset holds no constraints and is representative of real-world circumstances. Even though it is not stated explicitly, the main assumption behind using these datasets is that the observations belonging to these datasets are drawn from the same statistical distribution of all possible observations naturally occurring in real-world.

In 2011, Torralba and Efros proposed to examine dataset bias in twelve popular image datasets by observing if it is possible to train a machine learning model to identify the dataset a given image is selected from [33]. Considering the random guess accuracy is only , the authors found that humans were able to perform at

, while a simple support vector machine classifier performed at

. The authors furthermore demonstrated the inability to perform cross-dataset generalization, thereby highlighting how models trained on typical datasets actually overfit and thus fail to generalize to other datasets yet alone to real-world settings.

A similar problem of overfitting stems from the hyper-competitive nature of machine learning, where there is little incentive of trying to publish methods that have inferior performance compared to SotA on test datasets (see Section 3). Therefore, we can reasonably expect that effectively most research uses the test set as a validation set, rather than following the standard practice of defining a separate validation set from the training data. Recht et al. show this by creating a new test set for CIFAR10, a widely used image dataset, where they found that there was a significant drop in accuracy (4-15%) from the old test set to the new test set when tested with several DL architectures [34]

. In a more recent work, a similar phenomenon is also shown for the well-known ImageNet dataset, suggesting that the accuracy drops are caused by the models’ inability to generalize to slightly "harder" images than those found in the original test sets 


From the HARKing perspective, formulating hypotheses that are specifically designed to account for the observed results for a specific sample of observations go hand in hand with overfitting and failure of generalization. Furthermore, the selected datasets to run the proposed experiments on have to be in parallel with the hypothesis. For instance, the well-known Labeled Faces in the Wild dataset [36]

contains images of famous people only, but have been used extensively to test hypotheses of face recognition or person identification in unconstrained settings. And from the implementation perspective, by splitting a dataset into training, validation, and testing sets, we invariably risk giving the false impression that because our model may perform well on the test dataset, it will also generalize to images found in real world applications. In both cases mentioned above (using biased datasets and/or overfitting to specific test sets), it can be argued that hypotheses testing is conditional on the dataset in question, and therefore to convince a reader that HARKing has not occurred, an author should always take great care to demonstrate the generalizability of new methods. Obviously, overfitting is a problem encountered in ML in general and is not specific to neural networks. However, considering:

  1. [label=()]

  2. feed-forward neural networks are universal function approximators (by Universal Approximation Theorem

    ) as well as convolutional networks, i.e., a single hidden layer network containing a finite number of neurons can approximate continuous functions with arbitrary precision 

    [37, 38]

  3. the complexity of the computed function by a neural network grows exponentially with its depth, i.e., for every additional hidden layer, one needs exponentially more parameters to express the same function with a shallower network [39, 40]

deep neural architectures are very likely to suffer from overfitting due to their expressive power.

5 Automated Machine Learning

The traditional data science approach relies on many sequential tasks; i.e. data preprocessing and cleaning, feature engineering and selection, model selection and parameter tuning, postprocessing, and finally critical analysis of results. Often in practice, the human decision making processes in these tasks are inefficient (see Section 2

) or based on heuristics. Furthermore, the combined complexity of these tasks often present an insurmountable barrier for non-experts, and thus automated machine learning (AutoML) is a topic that has become increasingly popular in recent years, promising to automate (at least parts) of this pipeline in order to improve efficiency of machine learning and accelerating research.

Recently, the most popular AutoML task has focused extensively on neural architecture search (NAS) [41, 42, 43, 44, 45, 46, 47], i.e., automating the design of neural network architectures for the search of architectures that are superior to hand-crafted ones. Several other AutoML tasks include automated hyper-parameter optimization [48]

, activation function search 

[49], optimizer search [50], data augmentation policy search [51] or even search for better hardware utilization in heterogeneously distributed (mixture of CPUs and GPUs) computing environments [52]. The methods behind such meta-learning approaches are mainly based on Bayesian optimization [48], evolutionary algorithms [43, 46] or more recently on reinforcement learning [41, 49, 52]. Some of these methods are available both to the academy as well as to the industry as open source software or in the form of software-as-a-service.

These advancements not only help us discover better DL models and solutions in terms of quantitative metrics than hand-engineered ones, but also carry the possibility to transform the everyday working practices of machine learning researchers and practitioners. With AutoML, data scientists are expected to offload a significant portion of their routine work and focus on tasks that require a higher level thinking and creativity. However, certain issues have been raised related to AutoML approaches lately. For instance, Scuito et al. demonstrate that the search policies of state-of-the-art NAS techniques are no better than random policies [53]. Similarly, Li and Talwalkar show that random search with early-stopping is a competitive NAS baseline on two benchmark tasks - one from computer vision and one from natural language processing [54]. In addition, they discuss the reproducibility issues of published NAS results by elaborating on the necessity of having a tremendous amount of computation resources, lack of available source material/code and questionable robustness of published results [54].

Interestingly, the pursuit of simplifying machine learning development resulted in a significant increase in algorithmic complexity of AutoML methods including complicated training routines and architecture transformations [54]. This complexity makes it more difficult to pinpoint which components of the found solution is crucial for high performance. In addition, considering the lack of ablation studies (the analysis of systematic removal of components or features of a model in order to identify which of them are the most relevant) in many works, AutoML field creates a dangerous ground for HARKing.

6 The Insert_Adjective_Here AI Wave

6.1 Ethical AI

Ethical issues regarding current developments in machine learning are perhaps much more critical than they currently perceived to be; as we already encounter ethically questionable decisions given by algorithms, sometimes unbeknownst to us. Examples include replacing faces and voices in videos [55], detecting people using WiFi signals [56], deciding whose life to risk in an eminent accident [57] and generating fake news [58]. In various scenarios, ML impacts decisions on legal and ethical issues as well such as insurance, hiring, lending. Therefore, it is crucial to develop models that are fair and unbiased regardless of the biases in the data [59, 60]. This issue has been recently emphasized even by the European Commission in their ethics guidelines report for AI by underlining the importance of paying attention to situations involving more vulnerable groups such as children, persons with disabilities or minorities, or to situations with asymmetries of power or information (e.g. employee-employer or business-consumer) [61].

With established industries (e.g. example firearms), it is common for the researchers and developers to leave the responsibility of ethics to entities that follow them (e.g. arms sellers and legislators). However, most AI-based systems have been much faster to deploy than conventional technology. Therefore, it is highly desirable for researchers to discuss ethical implications of their work and create a dialogue about them at the earliest possible stage. While selecting research topics that raise ethical issues itself serves this purpose, the desire to present good results might deter the discussion.

Another important ethical issue revolves around covert AI systems. A human should always know if she/he is interacting with a human being or a machine, and it is the responsibility of us that this is reliably achieved. As AI practitioners, we should ensure that humans are made aware of - or able to request and validate the fact that – they interact with an AI identity [61]. Thus, hypothesis forming process should be clear and unambiguous, and should consider the possible use cases or implications as well. And in this pursuit, HARKing won’t do.

6.2 Human-centric AI

At the current stage, ML/DL algorithms are often designed as tools for defined domain experts, thus they need to address human needs and psychology in a realistic manner. To decrease the amount of HARKing, high-level domain experts should be incorporated to the study teams from the beginning as a collective intelligence of domain experts has considerable benefits and should be utilized whenever possible [62]. This will lead to more successful forming of a priori hypotheses and in the end should put pressure on scrutinizing results that do not support these hypotheses. Previously, worrying examples of failure in this have surfaced, where there has been only a limited input from the domain experts [63]. High-level expertise is especially relevant to create scientific hypotheses and should be differentiated from defining practical use-cases and training of AI, where a diverse spectrum of possible users should be affiliated to the project.

HARKing is potentially a serious threat especially in AI-driven change in medical practice. This applies mostly to the effect of failing to report a priori hypotheses that are unsupported by the current results [5]. The algorithms that will be used in medicine typically need to be clinically validated in laborious and high-cost trials [64]. Suppressing hypotheses after the results are known can lead to wrongly planned clinical trials, as the background scientific literature (meta-analyses) is biased and this can lead to losing credibility in the eyes of physicians and decision makers, together with spending a huge amount of limited human and financial resources available to run these trials.

6.3 Explainable, transparent and interpretable AI

Explainable artificial intelligence (XAI) is not only interesting as an academic curiosity; it is a necessity for the future. Developing explainable and transparent systems, as well as tools to measure transparency, is crucial for ethical AI development (see section

6.1). The main concept of XAI is centered around causal attribution as it is in human nature to understand causality naturally. Having such causal explanations will provide substantial leap in reaching human-like perception of AI systems and anthropomorphism [65]. Explainable AI and model interpretability may be used in a synonymous manner. However, we think that explainability may fall under the causality domain and interpretability may belong to the mechanistic explanation of the algorithmic and model internals [66].

Recent deep learning algorithms provide high predictive performance but limited ways to provide reasoning on how an algorithm produces such level of high performance that exceeds human abilities [67]. Even though there have been studies addressing this problem and proposing solutions [68, 69, 70], a common consensus on performing interpretation of ML and especially DL models has not been reached. In fact, even the definition of interpretability itself is not established, neither mathematically nor axiomatically in the literature [66]. Furthermore, recent studies question the robustness and security of these interpretation methods (e.g. to adversarial attacks) [71].

From HARKing perspective, one can relatively easily reverse engineer results to fit in a desired interpretation [69, 71, 72]. To avoid such practices, interpretable algorithms should not be reversible, nor should they only provide interpretation depending upon algorithmic priors. In this regard, approaches aiming at more theoretical explanations of why deep learning works, from learning theory to statistical physics [73, 74, 75], may be classified as true XAI research. These approaches, rather than focusing only on interpretation of the mechanistic approaches after the results are known, aim at finding an ab-initio technique, i.e., from the first-principles, to design a deep learning system without HARKing. Similarly, use of causal inference has recently been shown to be promising in understanding underlying mechanisms of deep learning systems [76] and if descriptive, causal modals can answer prediction, intervention and counterfactual questions [77].

In terms of transparency, an interesting question is whether are we, as humans, required to know all the details about the AI capabilities of the equipment and sensors that surround us. This can be argued both ways; for example, we know virtually nothing about the abilities of human drivers that use the same highway as we do. But similar to what happened with established technology in automotive (like ABS and automatic transmission), we should be able to know the workings, accuracy stats, advantages and disadvantages of emerging AI technologies. This concept overlaps with abovementioned mechanistic interpretability issue and perception of human-like attributions.

6.4 Reproducible AI

AI research is known and as a result appreciated for its significant contributions to open science (e.g. preprint archives), open source (e.g. code repositories, sharing of trained models etc.), open data and reproducible research paradigms. Yet, as a sub-field of computer science, it still shares a similar reproducibility crisis [78, 79, 80, 81, 82]. As Donoho et al. suggested, a computational research paper is merely an advertisement unless it is presented with an underlying code and data [78]. We believe one of the reasons of this reproducibility crisis is HARKing.

One essential contribution to this crisis in ML and especially in DL research is the lack of understanding of distinction between repeatability and reproducibility [83]. We consider repeatability as the ability to recreate the results of a study/paper and reproducibility as the ability to reach the same conclusions despite the variations in the irrelevant components of the experiments [84]. Obviously, the role of hypothesizing driven by sound scientific methodology is essential in differentiating the two. As discussed in Section 2, competitive nature of the field and elevated pressure of achieving research and business outputs in a fast manner lead to hurried claims of reproducibility (often confused with repeatability) just like the hurried claims of SotA. Once this is coupled with the avoidance of reporting negative results or similar selective reporting (see Section 3), reproducibility crisis becomes inevitable.

It is important to acknowledge the initiatives for encouraging and increasing reproducibility in ML/DL research. For instance, in NIPS 2019, a reproducibility checklist and a code submission policy is introduced, in which the code is expected to accompany the accepted papers. In AAAI Conference on Artificial Intelligence in 2019, a workshop on reproducible AI has been held. Similarly, a workshop on reproducibility in ML was held in International Conference on Learning Representations (ICLR) in 2019. Nevertheless, open questions remain such as "How can we measure reproducibility?", "What does it mean for a paper to have successful or unsuccessful replications?" or "What can the ML community learn from other fields?".

6.5 Accountable AI

Accountability of algorithmic decision-making systems (e.g. credit scoring) has been under discussion as well as under implementation for decades especially from the regulatory and legal perspective. However, the rapid pace of AI developments and real-world applications of them, introduced circumstances in which high-stakes decisions with significant consequences for people and broader society are made by ML algorithms. One such potential impact is an accident which can be, in this context, defined as an unintended and harmful behavior that emerges from poor design of real-world AI systems. Amodei et al. provides several concrete examples of such possible problems in AI safety including negative side effects (e.g. due to poorly designed objective functions), sensitivity to distributional shifts (the environment shifting away from the training environment) and reward hacking (the system gaming its objective function) [85].

Naturally, AI accountability is intertwined with explainability, reproducibility, fairness and human-centrism of design of these systems. Policies for demanding explanations of algorithmic decisions may help preventing negative consequences or may unintentionally hinder innovation while providing little meaningful protection, depending on their implementation and execution. For instance, European Union General Data Protection Regulation (GDPR) [86] introduced a potential accountability mechanism by right to explanation since May 2018, but the concrete consequences are still yet to be observed. Regarding the role of reproducibility in accountability of AI systems, the fatal accident recently caused by an autonomous car (belonging to Uber) is a suitable example. The preliminary report released by the United States National Transport Safety Board stated that the self-driving system software misclassified the pedestrian and the system was not designed to alert the human operator under such emergency conditions [87]. For the fair design of AI systems from the accountability perspective, the Gender Shades study [88] serves as an interesting example. In the study, biases present in commercial automated facial analysis algorithms are presented [88] and consequently, a recent study elaborated on the concept of actionable auditing by investigating the impact of publicly naming biased performance results of commercial AI products [89]. Certain opportunities for hybrid models in which humans and machines interact (for explaining failures [90] or intervening operations [91]) towards better AI accountability are also proposed in recent studies.

From the industry perspective, considering large companies and corporations entering an "AI race" in order to be the first to successfully employ AI in their domains, it is not surprising for accountability to take lower priority over invention and market leadership. But from the scientific methodology perspective, taking accountability of ML/DL models into account in the early stages of the research process, such as hypothesis forming, is imperative.

6.6 Privacy-aware AI

Current implementations of ML algorithms require access to data, which essentially opens up potential security and privacy risks. Therefore, privacy-aware or privacy-preserving AI notion and several studies along this paradigm has been conducted, leading to influential concepts including federated learning and differential privacy [92, 93]. With the use of homomorphic encryption, deep learning model inference on encrypted data was shown to be possible with a little trade-off from accuracy as well [94, 95]. In addition, Shokri et al. introduced and elaborated on the concept called membership inference attack, i.e., given a black-box machine learning model and a data record, determining whether this record was used as part of the model’s training dataset or not [96]. All these advancements are crucial to declare that several metrics are needed to assess and compare ML models and privacy preserving capability is one of them. For a good scientific conduct, our hypotheses on both the methods and impacts of our research should consider these concepts.

7 Conclusion

Hypothesizing after the results are known has been observed in several fields of research throughout the history and recently deep learning research exhibits several instances of it as well. In this work, we tried to give examples of HARKing in machine learning and especially in deep learning research. We elaborated on the reasons and consequences of this troubling trend by discussing overemphasis on single-metric model comparisons and benchmarks (Section 2), tendency to refrain from reporting negative results (Section 3), failure of generalization (Section 4) and automatic machine learning (Section 5). Finally, HARKing and importance of formulating an a priori hypothesis is reviewed from the perspective of ethical, human-centric, explainable, reproducible, accountable and privacy-preserving AI notions (Section 6).

We would like to emphasize the importance of discussions for achieving concrete reforms in the mentioned issues. Cultural change and legitimate interventions (such as the proposal in Section 3) in deep learning research should be encouraged by addressing these issues as much as we can in a constructive manner. As the aimed progress is a collaborative effort, researchers, practitioners, reviewers, editors, policy-makers, decision-makers, funding agencies, corporations and governmental entities need to act collectively. We believe that prevention of HARKing will help in engineering ethical, accountable, transparent, unbiased and scientifically superior deep learning solutions for the common good of the society we will be living in eventually. We also hope and believe that this work will stir discussions and debates, and will contribute towards that goal.