In the context of science, reproducibility refers to the ability to reproduce the scientific results of other researchers. Science is a highly collaborative process, with most research built on the work of others. Generating reproducible research is important, as it helps others to verify findings and build further experiments using shared code as starting points. In the recent years, experimental reproducibility – and it’s lack of – has been a point of concern in the scientific community. This problem – also known as the reproducibility crisis or the replication crisis – has been extensively studied in the fields of psychology [john2012measuring], economics [camerer2016evaluating] and medicine [begley2013reproducibility]. Recently, [hutson2018artificial] and [olorisade2017reproducibility] have shown that reproducibility is a problem in the field of machine learning as well.
Machine learning research typically involves several rounds of experimentation, which have challenges such as inherent randomness due to seeds, large number of hyper-parameters, incremental changes to the code and often experiments are run in various environments with variations in software and hardware between experiments. On occasion, details that are necessary to reproduce experiments are not included due to confidentiality requirements and space constraints in the papers. In addition, due to the pressure of the publish-or-perish nature of academia, researchers are incentivized to spend their time trying new ideas or different experiments instead of building robust infrastructure to support rigorous experimentation. Therefore, the code that is written for such experiments is of poor quality, and most researchers are reluctant to share such code. In the cases when such code is shared, the poor quality of the code makes it difficult to read or use and therefore reproduce.
|Paper has experiments||100%|
Paper uses neural networks
|All hyperparams for proposed algorithm are provided||90%|
|All hyperparams for baselines are provided||60%|
|Code is linked||55%|
|Method for choosing hyperparams is specified||20%|
|Evaluations on some variation of a hold-out test set||10%|
|Significance testing applied||5%|
Analysis of 50 Reinforcment Learning publications at major Machine Learning conferences. These results allow to draw the conclusion that reproducibility is still difficult, through a combination of source codes not available, not reporting all hyper-parameters, and not reporting the hyper-parameter selection method. (Source: video recording of the talk1)
||Two Dimensional||Image Auto-Encoding|
||Video Action Recognition||Image Similarity|
||Semantic Segmentation||Image Classification|
||Sentence Classification||Word Spotting|
Reproducible Research has been a topic of discussion within research for several years now. In 2015, Nature asked 1500 scientists [Baker2016a] across different domains to gather their opinions on reproducibility within their respective fields. A large majority of 90% answered that there is a significant (52%), or a slight (38%) crisis regarding this topic. Unfortunately this study does not directly account for Computer Science or Machine Learning.
At NeurIPS 2018, Joelle Pineau had an invited talk on the topic of "Reproducible, Reusable, and Robust Reinforcement Learning"111see: http://bit.ly/neurips_reproducibility. In this talk she presented a table as shown in Table I. In a study they analyzed 50 Reinforcment Learning papers presented at large Machine Learning conferences (NeurIPS, ICML, ICLR), showing that only a small majority of them provide source code to their publications. Combined with the facts that neither all hyper-parameters are provided, nor the method for choosing them allows for the conclusions that reproducing the results of these papers would be a very difficult task
The abovementioned examples show that there is a need for tools that support researchers in performing reproducible research. Such a tool should provide support for the following when running experiments:
Fixing and storing random seeds
Ensuring that an exact version of the code is available
Provide boilerplate code for performing experiments
Different tools have been proposed previously and we provide an overview of them in [albertipondenkandath2018deepdiva]. These solution, however, often target only one or few aspects of a machine learning pipeline, whereas we aim at providing support for the entire work-flow.
Moreover, DeepDIVA provides basic implementations for common tasks, supports various visualizations and addresses the situation where an algorithm has non-deterministic behavior e.g. because of random initialization. Finally, unlike others attempts which aim to be platform or language independent, DeepDIVA relies on a working Python environment and specific settings, allowing it to be lightweight and enable using GPU hardware in a straightforward way.
We aim to contribute towards the growing needs for reproducibility and openness in the machine learning community by providing DeepDIVA, a framework that tries to close the gap between good engineering practices and fast moving research work-flows. While an early version of DeepDIVA built for the purposes of handwriting recognition has presented in [albertipondenkandath2018deepdiva], this paper presents a much more extensive framework with a wide range of functionality and templates in machine learning tasks.
DeepDIVA allows for quick experimentation for a variety of common scenarios, such as image, video and language classification, similarity matching, image segmentation, image auto-encoding among other things. The framework has been built in a modular and easily extensible manner such that additional tasks and capabilities can be added without extensive efforts. DeepDIVA integrates popular tools such as TensorBoard222https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard (for aggregating all the visualizations and results produced) and SigOpt [sigopt] for hyper-parameter optimization. Finally, the documentation and tutorials provided help smoothen the learning curve for new users or contributors.
Ii Reproducing Experiments
Often, reproducing the code of others is quite difficult because not all papers come with the sources necessary to make the experiments work. Even when the code is available, there are often issues with getting the code to run. We try to alleviate this problem by proposing to conduct all research inside the easy-to-configure DeepDIVA environment.
Ii-a How It Is Done
The DeepDIVA environment can be set up using a one-click bash script (see Section III). With a functioning DeepDIVA environment, one can reproduce any experiment using: a link to the appropriate fork of DeepDIVA, the specific commit identifier of the particular experiment, and the list of parameters or that were used to run the experiment or a bash file that contains the exact commands to re-run the experiment.
Ii-A1 Log Everything
The framework saves logs that detail several facets of the training procedure, such as: experimental setup parameters, information about the training data, evaluation metrics, model parameters, and all visualizations generated during the training. In addition to all of this, the framework also makes a snapshot of the code-base (as seen at execution time) and stores it along with the logs. With all of this information, it’s possible to analyze the training procedure after the process or use intermediate model representations for other purposes. This can be quite helpful for experiments that take longer amounts of time to run.
Ii-A2 Seed All Randomness
Controlling for code and parameters is not enough to ensure reproducbility in machine learning. Many machine learning methods are randomly instantiated, and the results of such experiments are highly subject to the instantiation. To make a perfect reproduction, or indeed to compare the effect of methods or parameters, it is necessary to be able to remove all sources of randomness from the experiment. DeepDIVA allows the user to specify a seed, upon which all sources of randomness in the system are controlled, allowing for perfect reproducibility.
Ii-A3 Enforcing Version Control
A scenario that most researchers are likely familiar with is the sudden inability to get results that you had a few code changes prior. This can be due to changing hard-coded parameters, or (un)commenting lines of code to change execution flow. We aim to tackle this scenario by enforcing the user to commit their code before running any experiments. The framework checks before running an experiment if the user has checked in and committed their code. However, this can be annoying and tedious during the development process to commit all small changes before running experiments. In this case, the backup solution of the framework activates and makes a copy of all the source code in repository in the log files of the experiment.
Ii-A4 Data Integrity Management
In order to ensure full reproducibility, one requires access to the same data. Since the collection, storage and dissemination of datasets is beyond the scope of the framework, it becomes necessary to have a way to ensure that one is in possession of the exact same data as the experiment to be reproduced. This feature was highly requested by the community since the frameworks initial release.
In DeepDIVA we implemented the verification through the use of a footprint file. The creation of this footprint is automated and happens immediately at the start of an experiment (if the file has not been generated before). The content of the file is a JSON tree which stores the entire dataset structure in great detail ,i.e., it stores all file names and their SHA-1 hashes. Moreover, there is a global “last modified” tag which contains the most recent value for the entire folder (spanning every file contained in it recursively). This tag is used at run-time to verify if the dataset has been modified since the footprint generation. This check as one can imagine is very quick and hence does not affect the regular flow or run-time of an experiment. However, this type of verification, albeit quick, is not secure against a malicious manipulation of the data, since a skilled attacker might modify it in subtle ways [alberti2018tampering] and tamper the time stamp on the file system too. To combat this — very remote — threat, there is the possibility to activate a deep inspection of the dataset integrity using the stored SHA-1 hashes. In this way we can ensure that if the dataset integrity verification is successful, one is sure that there are no differences between the data on the file system and the dataset described by the footprint.
Iii Productivity Out-of-the-Box
Most researchers have established workflows which they are are comfortable with, however, these workflows may not be best-practices compliant and it can be quite difficult to change the way you do things in order to achieve the ever-increasing best-practices ideals of the field. Therefore, we try to make the experience of using DeepDIVA for the first time a quick simple and painless one, and allow researchers to be operational and productive as soon as possible.
DeepDIVA is very easy to setup on MacOS and Ubuntu. Once the repository has been cloned, setting up a fully functional environment is a single bash script away. Once set up, the framework has boilerplate code for several different scenarios (as seen in Table II), notably: word-spotting, similarity matching for image, classification of images, natural language, video and bi-dimensional data.
These templates cover many common scenarios that researchers often encounter, and as the scenarios are written in a modular fashion, they can be easily adapted or extended to other tasks in a quick and painless manner. This helps a researcher be quickly productive as implementing the boilerplate constitutes a significant part of the development process. Starting from an already implemented task and adapting it for a specific purpose allows researchers to avoid developing redundant code. The following sections recap several features that support a typical research workflow in DeepDIVA (introduced in [albertipondenkandath2018deepdiva]).
Iii-a Prepare Your Data
The first step in any machine learning task is acquiring and preparing the data. DeepDIVA comes equipped with some tools to support this task.
Download a dataset with a click DeepDIVA supports downloading and preparing several datasets (CIFAR-10 [cifar], MNIST [mnist], SVHN [svhn], STL-10 [stl]) with a single command.
Split your dataset The framework contains a script to split an arbitrary dataset (stored in a standard format) into classic machine learning splits.
Once your data has been downloaded or prepared in the correct format, DeepDIVA loads and pre-processes the datasets such that they can be used in the appropriate tasks.
Iii-B See What Your Network Thinks With 2D Data
During the research process for a new idea, you might want to try out your idea on a simple toy-dataset before progressing to more complex datasets. To enable you to do so, DeepDIVA offers a workflow to test out your idea on bi-dimensional data which allows you can visualize exactly what your network thinks of the output space. DeepDIVA contains all the necessary code to perform such a visual analysis of the network, and all a user needs to do is to modify the task and implement their research idea.
Iii-C Real-Time Data Visualization:
We use the Tensorboard application developed by TensorFlow[tensorflow] to aggregate all the visualizations produced by the framework. Normal training and validation curves are plotted directly, and all other visualization produced by the framework are added directly into images section of the corresponding experiment. DeepDIVA dynamically generates plots for executions (see Fig. (a)a) and makes them available in Tensorboard, thus experiments with differing configurations can be compared, as well as performance of two or more methods. The multi-run flag automatically reruns an experiment a given number of times and aggregates the result into a plot (see Fig. (b)b). DeepDIVA also generates a confusion matrix during evaluation (see Fig. (c)c).
Iii-D Automatic Hyper-parameter Optimization:
Instead of having to perform the tedious and time-consuming procedure of optimizing hyper-parameters by hand, the researcher can simply use a single command line parameter and let the framework deal with it thanks to SigOpt [sigopt] integration.
Iv Be a Part Of It
Many of the available tools are extremely good at what they are designed for, however they often have steep learning curves. Even during the setup phase, several tools expect a user to have the skill, time and patience to set up an environment manually. Indeed, the authors of this paper have even had the experience of encountering Github repositories where the setup instructions are to simply install packages as you encounter errors. This often discourages the average user and significantly increases the time required to get to a productive stage.
Additionally, the quality of the documentation or tutorials (or lack thereof) determine the impact of a tool, no matter how effective it may be. When this is combined with stringent contribution guidelines, or lack of an open-source nature, it can render a tool community-unfriendly. This is a major issue for the field as the quality of a framework is measured not only by the quality of the results delivered by it, but also by it’s maintenance, the learning curve and the adoption overhead.
To foster a friendly and productive community of researchers, we try to make DeepDIVA accessible by tackling the aforementioned problems as follow:
Documentation: The framework is documented333See the documentation at link_redacted_for_blind_submission such that it can be used in a educational environment for didactic purposes.
Tutorials: There is a friendly “Getting started” followed by a plethora of tutorials 444See tutorials at link_redacted_for_blind_submission which will help a new user learn and use the available features efficiently. For example, there are tutorials on how to prepare the data, load it and run the implemented tasks (see II) as well as how to visualize the results. More experienced people can also find tutorials on how to extend the framework and perform advanced operations with it. These tutorials are not intended to teach someone machine learning, but rather how to use DeepDIVA to do achieve their ideas better.
Fork It555See the repository at link_redacted_for_blind_submission: DeepDIVA is built with the goal of being extensible and modular. It is open-source and comes with verbose documentation such that the core code will be accessible to everyone. It has been designed in a modular way which favors and encourages growth and modifications, in contrast with other solutions which optimize performance at the expense of maintenance Moreover, being a collaborative project, additions suggested by users can be integrated benefiting the community as a whole. This is not always possible or can be difficult due to closed source software, commercial solutions or impenetrable core code.
V Conclusion and Future Work
We contribute towards meeting the demands for reproducibility and openness in machine learning by providing DeepDIVA: an open-source Python deep-learning framework designed to enable quick and intuitive setup of reproducible experiments with a large range of useful analysis functionality. We show how researchers can quickly include it in their workflow (thanks to detailed documentation and easy tutorials) thus saving time while enabling reproducing their research in a quick and intuitive fashion. In the near future DeepDIVA will include more visualization tools, provided by the small (but thriving) community of developers which is forming around it.
The work presented in this paper has been partially supported by the HisDoc III project funded by the Swiss National Science Foundation with the grant number _.