Log In Sign Up

Improving Reproducible Deep Learning Workflows with DeepDIVA

The field of deep learning is experiencing a trend towards producing reproducible research. Nevertheless, it is still often a frustrating experience to reproduce scientific results. This is especially true in the machine learning community, where it is considered acceptable to have black boxes in your experiments. We present DeepDIVA, a framework designed to facilitate easy experimentation and their reproduction. This framework allows researchers to share their experiments with others, while providing functionality that allows for easy experimentation, such as: boilerplate code, experiment management, hyper-parameter optimization, verification of data integrity and visualization of data and results. Additionally, the code of DeepDIVA is well-documented and supported by several tutorials that allow a new user to quickly familiarize themselves with the framework.


page 2

page 4


DeepDIVA: A Highly-Functional Python Framework for Reproducible Experiments

We introduce DeepDIVA: an infrastructure designed to enable quick and in...

Provenance tracking in the LHCb software

In order to facilitate reproducibility of research in particle physics, ...

A streamable large-scale clinical EEG dataset for Deep Learning

Deep Learning has revolutionized various fields, including Computer Visi...

Orpheus: A New Deep Learning Framework for Easy Deployment and Evaluation of Edge Inference

Optimising deep learning inference across edge devices and optimisation ...

Modifying NFD for NDN Experimentation: A Review

NFD is the most popular Named-Data Networking (NFD) router software. In ...

PyText: A Seamless Path from NLP research to production

We introduce PyText - a deep learning based NLP modeling framework built...

SIERRA: A Modular Framework for Research Automation

Modern intelligent systems researchers employ the scientific method: the...

I Introduction

In the context of science, reproducibility refers to the ability to reproduce the scientific results of other researchers. Science is a highly collaborative process, with most research built on the work of others. Generating reproducible research is important, as it helps others to verify findings and build further experiments using shared code as starting points. In the recent years, experimental reproducibility – and it’s lack of – has been a point of concern in the scientific community. This problem – also known as the reproducibility crisis or the replication crisis – has been extensively studied in the fields of psychology [john2012measuring], economics [camerer2016evaluating] and medicine [begley2013reproducibility]. Recently, [hutson2018artificial] and [olorisade2017reproducibility] have shown that reproducibility is a problem in the field of machine learning as well.

Machine learning research typically involves several rounds of experimentation, which have challenges such as inherent randomness due to seeds, large number of hyper-parameters, incremental changes to the code and often experiments are run in various environments with variations in software and hardware between experiments. On occasion, details that are necessary to reproduce experiments are not included due to confidentiality requirements and space constraints in the papers. In addition, due to the pressure of the publish-or-perish nature of academia, researchers are incentivized to spend their time trying new ideas or different experiments instead of building robust infrastructure to support rigorous experimentation. Therefore, the code that is written for such experiments is of poor quality, and most researchers are reluctant to share such code. In the cases when such code is shared, the poor quality of the code makes it difficult to read or use and therefore reproduce.

Paper has experiments 100%

Paper uses neural networks

All hyperparams for proposed algorithm are provided 90%
All hyperparams for baselines are provided 60%
Code is linked 55%
Method for choosing hyperparams is specified 20%
Evaluations on some variation of a hold-out test set 10%
Significance testing applied 5%

Analysis of 50 Reinforcment Learning publications at major Machine Learning conferences. These results allow to draw the conclusion that reproducibility is still difficult, through a combination of source codes not available, not reporting all hyper-parameters, and not reporting the hyper-parameter selection method. (Source: video recording of the talk

Input Task Output Input Task Output

Two Dimensional Image Auto-Encoding

Video Action Recognition Image Similarity

Semantic Segmentation Image Classification

Sentence Classification Word Spotting

TABLE II: Examples of tasks for which the entire pipeline is already implemented in DeepDIVA. These tasks are inherently very different, especially their source domain e.g. images, videos, text and bi-dimensional data. Yet, their work-flow and the code infrastructure necessary for handling them is mostly similar. DeepDIVA leverages these similarities by implementing a modular structure, which, combined with the extensive documentation and the multiple tutorials available makes extending or modifying an existing task an easy and swift procedure.

Reproducibility Crisis

Reproducible Research has been a topic of discussion within research for several years now. In 2015, Nature asked 1500 scientists [Baker2016a] across different domains to gather their opinions on reproducibility within their respective fields. A large majority of 90% answered that there is a significant (52%), or a slight (38%) crisis regarding this topic. Unfortunately this study does not directly account for Computer Science or Machine Learning.

At NeurIPS 2018, Joelle Pineau had an invited talk on the topic of "Reproducible, Reusable, and Robust Reinforcement Learning"111see: In this talk she presented a table as shown in Table I. In a study they analyzed 50 Reinforcment Learning papers presented at large Machine Learning conferences (NeurIPS, ICML, ICLR), showing that only a small majority of them provide source code to their publications. Combined with the facts that neither all hyper-parameters are provided, nor the method for choosing them allows for the conclusions that reproducing the results of these papers would be a very difficult task

The abovementioned examples show that there is a need for tools that support researchers in performing reproducible research. Such a tool should provide support for the following when running experiments:

  • Fixing and storing random seeds

  • Ensuring that an exact version of the code is available

  • Provide boilerplate code for performing experiments

Different tools have been proposed previously and we provide an overview of them in [albertipondenkandath2018deepdiva]. These solution, however, often target only one or few aspects of a machine learning pipeline, whereas we aim at providing support for the entire work-flow.

Moreover, DeepDIVA provides basic implementations for common tasks, supports various visualizations and addresses the situation where an algorithm has non-deterministic behavior e.g. because of random initialization. Finally, unlike others attempts which aim to be platform or language independent, DeepDIVA relies on a working Python environment and specific settings, allowing it to be lightweight and enable using GPU hardware in a straightforward way.

Main Contribution

We aim to contribute towards the growing needs for reproducibility and openness in the machine learning community by providing DeepDIVA, a framework that tries to close the gap between good engineering practices and fast moving research work-flows. While an early version of DeepDIVA built for the purposes of handwriting recognition has presented in [albertipondenkandath2018deepdiva], this paper presents a much more extensive framework with a wide range of functionality and templates in machine learning tasks.

DeepDIVA allows for quick experimentation for a variety of common scenarios, such as image, video and language classification, similarity matching, image segmentation, image auto-encoding among other things. The framework has been built in a modular and easily extensible manner such that additional tasks and capabilities can be added without extensive efforts. DeepDIVA integrates popular tools such as TensorBoard222 (for aggregating all the visualizations and results produced) and SigOpt [sigopt] for hyper-parameter optimization. Finally, the documentation and tutorials provided help smoothen the learning curve for new users or contributors.

Ii Reproducing Experiments

Often, reproducing the code of others is quite difficult because not all papers come with the sources necessary to make the experiments work. Even when the code is available, there are often issues with getting the code to run. We try to alleviate this problem by proposing to conduct all research inside the easy-to-configure DeepDIVA environment.

Ii-a How It Is Done

The DeepDIVA environment can be set up using a one-click bash script (see Section III). With a functioning DeepDIVA environment, one can reproduce any experiment using: a link to the appropriate fork of DeepDIVA, the specific commit identifier of the particular experiment, and the list of parameters or that were used to run the experiment or a bash file that contains the exact commands to re-run the experiment.

Ii-A1 Log Everything

The framework saves logs that detail several facets of the training procedure, such as: experimental setup parameters, information about the training data, evaluation metrics, model parameters, and all visualizations generated during the training. In addition to all of this, the framework also makes a snapshot of the code-base (as seen at execution time) and stores it along with the logs. With all of this information, it’s possible to analyze the training procedure after the process or use intermediate model representations for other purposes. This can be quite helpful for experiments that take longer amounts of time to run.

Ii-A2 Seed All Randomness

Controlling for code and parameters is not enough to ensure reproducbility in machine learning. Many machine learning methods are randomly instantiated, and the results of such experiments are highly subject to the instantiation. To make a perfect reproduction, or indeed to compare the effect of methods or parameters, it is necessary to be able to remove all sources of randomness from the experiment. DeepDIVA allows the user to specify a seed, upon which all sources of randomness in the system are controlled, allowing for perfect reproducibility.

Ii-A3 Enforcing Version Control

A scenario that most researchers are likely familiar with is the sudden inability to get results that you had a few code changes prior. This can be due to changing hard-coded parameters, or (un)commenting lines of code to change execution flow. We aim to tackle this scenario by enforcing the user to commit their code before running any experiments. The framework checks before running an experiment if the user has checked in and committed their code. However, this can be annoying and tedious during the development process to commit all small changes before running experiments. In this case, the backup solution of the framework activates and makes a copy of all the source code in repository in the log files of the experiment.

Ii-A4 Data Integrity Management

In order to ensure full reproducibility, one requires access to the same data. Since the collection, storage and dissemination of datasets is beyond the scope of the framework, it becomes necessary to have a way to ensure that one is in possession of the exact same data as the experiment to be reproduced. This feature was highly requested by the community since the frameworks initial release.

In DeepDIVA we implemented the verification through the use of a footprint file. The creation of this footprint is automated and happens immediately at the start of an experiment (if the file has not been generated before). The content of the file is a JSON tree which stores the entire dataset structure in great detail ,i.e., it stores all file names and their SHA-1 hashes. Moreover, there is a global “last modified” tag which contains the most recent value for the entire folder (spanning every file contained in it recursively). This tag is used at run-time to verify if the dataset has been modified since the footprint generation. This check as one can imagine is very quick and hence does not affect the regular flow or run-time of an experiment. However, this type of verification, albeit quick, is not secure against a malicious manipulation of the data, since a skilled attacker might modify it in subtle ways [alberti2018tampering] and tamper the time stamp on the file system too. To combat this — very remote — threat, there is the possibility to activate a deep inspection of the dataset integrity using the stored SHA-1 hashes. In this way we can ensure that if the dataset integrity verification is successful, one is sure that there are no differences between the data on the file system and the dataset described by the footprint.

Iii Productivity Out-of-the-Box

Most researchers have established workflows which they are are comfortable with, however, these workflows may not be best-practices compliant and it can be quite difficult to change the way you do things in order to achieve the ever-increasing best-practices ideals of the field. Therefore, we try to make the experience of using DeepDIVA for the first time a quick simple and painless one, and allow researchers to be operational and productive as soon as possible.

DeepDIVA is very easy to setup on MacOS and Ubuntu. Once the repository has been cloned, setting up a fully functional environment is a single bash script away. Once set up, the framework has boilerplate code for several different scenarios (as seen in Table II), notably: word-spotting, similarity matching for image, classification of images, natural language, video and bi-dimensional data.

These templates cover many common scenarios that researchers often encounter, and as the scenarios are written in a modular fashion, they can be easily adapted or extended to other tasks in a quick and painless manner. This helps a researcher be quickly productive as implementing the boilerplate constitutes a significant part of the development process. Starting from an already implemented task and adapting it for a specific purpose allows researchers to avoid developing redundant code. The following sections recap several features that support a typical research workflow in DeepDIVA (introduced in [albertipondenkandath2018deepdiva]).


Example of comparison accuracy for two different training protocols (orange and pink) on a classification task. The visualization tool is not limited to two instances and allows for comparing an arbitrary number of instances. AnonymousFramework automatically measures the batch-wise (not shown in Figure) and epoch-wise loss and accuracy and plots in Tensorboard.


In all the experiments which involve randomness it is useful to evaluate how that affects the results obtained. Considering that very often in deep learning the networks are initialized with random weights, this is a common scenario. In this figure is shown an evaluation of how randomness affects execution by visualizing the aggregated results of multiple runs. Here the full line represent the mean value, the dotted lines are the highest and the lowest results obtained. Finally, the shaded area indicates the variance over all runs.


Confusion matrices are a well established tool for visualizing the performance of a system both in a binary and in a multi-class setting. Our framework produces a confusion matrix every time the model is validated and finally when it is tested. In the above figure is shown a confusion matrix for a 4 classes task, where the darker the color the higher the amount of samples was classified as such. Ideally the confusion matrix should look as full color on the diagonal and white everywhere else.

(d) Visualizing features is common step in a deep learning research pipeline as it often provides an insight on the model and/or on the data one is working with. We integrated the native feature visualization of Tensorboard into the framework. Specifically one can choose to use the T-Distributed Stochastic Neighbor Embedding (T-SNE) [maaten2008visualizing]

or the Principal Component Analysis (PCA) to project the high-dimensional features embedding in a 2D or 3D space. In this figure is shown an example of T-SNE feature projection of CIFAR-10.

Fig. 1: In this figure are shown some examples of different visualizations produced automatically by DeepDIVA. This is not an exhaustive list, but many other visualizations are task-specific and might required a significant amount of context to be understood. All visualization as available in real-time as the training progresses. This is a point which we believe to be critically important, as it allows to take important decisions before the end of the experiments thus saving an conspicuous amount time. Credit for the figures to  [albertipondenkandath2018deepdiva].

Iii-a Prepare Your Data

The first step in any machine learning task is acquiring and preparing the data. DeepDIVA comes equipped with some tools to support this task.

  • Download a dataset with a click DeepDIVA supports downloading and preparing several datasets (CIFAR-10 [cifar], MNIST [mnist], SVHN [svhn], STL-10 [stl]) with a single command.

  • Split your dataset The framework contains a script to split an arbitrary dataset (stored in a standard format) into classic machine learning splits.

  • Analyze the data

    Computing the mean, standard deviation and class distribution for pre-processing the data is a standard operation.

    DeepDIVA provides scripts to compute this for large datasets in an online manner.

  • Ensure data integrity DeepDIVA keeps track of your datasets to ensure that they are not modified. More details in Section II-A4

Once your data has been downloaded or prepared in the correct format, DeepDIVA loads and pre-processes the datasets such that they can be used in the appropriate tasks.

Iii-B See What Your Network Thinks With 2D Data

During the research process for a new idea, you might want to try out your idea on a simple toy-dataset before progressing to more complex datasets. To enable you to do so, DeepDIVA offers a workflow to test out your idea on bi-dimensional data which allows you can visualize exactly what your network thinks of the output space. DeepDIVA contains all the necessary code to perform such a visual analysis of the network, and all a user needs to do is to modify the task and implement their research idea.

Iii-C Real-Time Data Visualization:

We use the Tensorboard application developed by TensorFlow 

[tensorflow] to aggregate all the visualizations produced by the framework. Normal training and validation curves are plotted directly, and all other visualization produced by the framework are added directly into images section of the corresponding experiment. DeepDIVA dynamically generates plots for executions (see Fig. (a)a) and makes them available in Tensorboard, thus experiments with differing configurations can be compared, as well as performance of two or more methods. The multi-run flag automatically reruns an experiment a given number of times and aggregates the result into a plot (see Fig. (b)b). DeepDIVA also generates a confusion matrix during evaluation (see Fig. (c)c).

Iii-D Automatic Hyper-parameter Optimization:

Instead of having to perform the tedious and time-consuming procedure of optimizing hyper-parameters by hand, the researcher can simply use a single command line parameter and let the framework deal with it thanks to SigOpt [sigopt] integration.

Iv Be a Part Of It

Many of the available tools are extremely good at what they are designed for, however they often have steep learning curves. Even during the setup phase, several tools expect a user to have the skill, time and patience to set up an environment manually. Indeed, the authors of this paper have even had the experience of encountering Github repositories where the setup instructions are to simply install packages as you encounter errors. This often discourages the average user and significantly increases the time required to get to a productive stage.

Additionally, the quality of the documentation or tutorials (or lack thereof) determine the impact of a tool, no matter how effective it may be. When this is combined with stringent contribution guidelines, or lack of an open-source nature, it can render a tool community-unfriendly. This is a major issue for the field as the quality of a framework is measured not only by the quality of the results delivered by it, but also by it’s maintenance, the learning curve and the adoption overhead.

To foster a friendly and productive community of researchers, we try to make DeepDIVA accessible by tackling the aforementioned problems as follow:

No Setup Time: DeepDIVA can be setup with a single bash script on both Ubunutu and MacOS. (see Section II-A)

Documentation: The framework is documented333See the documentation at link_redacted_for_blind_submission such that it can be used in a educational environment for didactic purposes.

Tutorials: There is a friendly “Getting started” followed by a plethora of tutorials 444See tutorials at link_redacted_for_blind_submission which will help a new user learn and use the available features efficiently. For example, there are tutorials on how to prepare the data, load it and run the implemented tasks (see II) as well as how to visualize the results. More experienced people can also find tutorials on how to extend the framework and perform advanced operations with it. These tutorials are not intended to teach someone machine learning, but rather how to use DeepDIVA to do achieve their ideas better.

Fork It555See the repository at link_redacted_for_blind_submission: DeepDIVA is built with the goal of being extensible and modular. It is open-source and comes with verbose documentation such that the core code will be accessible to everyone. It has been designed in a modular way which favors and encourages growth and modifications, in contrast with other solutions which optimize performance at the expense of maintenance Moreover, being a collaborative project, additions suggested by users can be integrated benefiting the community as a whole. This is not always possible or can be difficult due to closed source software, commercial solutions or impenetrable core code.

V Conclusion and Future Work

We contribute towards meeting the demands for reproducibility and openness in machine learning by providing DeepDIVA: an open-source Python deep-learning framework designed to enable quick and intuitive setup of reproducible experiments with a large range of useful analysis functionality. We show how researchers can quickly include it in their workflow (thanks to detailed documentation and easy tutorials) thus saving time while enabling reproducing their research in a quick and intuitive fashion. In the near future DeepDIVA will include more visualization tools, provided by the small (but thriving) community of developers which is forming around it.


The work presented in this paper has been partially supported by the HisDoc III project funded by the Swiss National Science Foundation with the grant number _.