Computational reproducibility of research is widely accepted as necessary for result validation, sharing and reuse [chen2018open, stodden2010scientific]. Dozens of standalone tools have been created to preserve research materials and thus enable reproducibility. Some of these tools help with documentation [kluyver2016jupyter], the others with tracking system dependencies thus allowing cross-platform compatibility [boettiger2015introduction], some even capture the whole analysis in an automatized workflow [goecks2010galaxy]. A number of reproducibility tools are web-based with a user-friendly graphical interface [staubitz2016codeocean]. When employed in the research process, they are indeed able to solve the problems of reproducibility and allow reuse. However, reproducibility-focused research is still not common in scientific communities, and a small number of people who use the tools represent an exception. There are several reasons why these tools, although useful, are not ubiquitously used. First, researchers are typically working under constant pressure to present results and publish papers [baker2016there], which does not leave them sufficient time to employ reproducibility practices in their everyday work. Often in research groups, many computationally intensive tasks are done by students or young scientists with typically fixed-term stay who need to complete their studies and ensure future jobs [trisovic2018data]. Secondly, there is often not enough training on best practices in data analysis in social and (some) natural sciences, even though these disciplines also face an expansion of data and the need for computationally intensive analyses. Researchers in those fields are hence hesitant to use more software than they need to, including the reproducibility tools. Because of this, the tools are not normally used in everyday scientific work, which thus hinders research reproducibility and reuse.
In this paper, we argue that the best way to foster reproducibility is to integrate it within existing scientific software that is already in use. This way using reproducibility tools is simple and it seamlessly integrates into daily researchers’ work. We demonstrate this through our work at LHCb, the high-energy physics experiment at CERN. According to a survey conducted at the experiment [trisovic2018data], most LHCb analysts use the official software to fully or in part perform their physics analysis. This served as a starting point for addressing the reproducibility challenge. Our solution is designed and developed within the official experimental software to capture data provenance, which is then saved inside the output data file on the disk. The stored provenance allows understanding how a file was produced [pasquier2017if] and provides sufficient information to entirely reproduce the dataset, eliminating the need for the original input code or even documentation.
2 The LHCb software
The LHCb software is the essential component in all data-related activities at LHCb. It is used in a range of data processing environments, from real-time data collection in the experimental setup, data reconstruction, simulation, to advanced physics analyses and collision visualization [cattaneo2001gaudi]. The LHCb software is based on the C++ object-oriented framework called Gaudi [barrand2001gaudi], which provides a common infrastructure and environment for the software applications of the experiment [corti2006software]. Gaudi was used during both the first and second run of the LHC and it proved to be a reliable and flexible platform for LHCb data processing on the CERN Computing Grid [shiers2007worldwide]. The project was created in 1998 by the LHCb collaboration, but since then it was released open-source [gaudi-gitlab] and adopted by a number of other high-energy physics experiments like for example ATLAS [1742-6596-219-4-042006].
The software is organized in a modular architecture of smaller and more manageable packages. Each module has a defined functionality and an interface through which it interacts with the other components. Such architecture provides layers of abstraction for developers, meaning that one does not need to understand the whole framework to contribute to one of the components.
The basic modules of the Gaudi framework in a working condition and the connections between them are shown in Figure 1. The Application manager (in code denoted as ApplicationMgr) controls the execution of the jobs within the framework. It creates and initializes the required modules in the system, and retrieves input data. The input data is a collection of highly-structured information that describe particle collisions (also called events) recorded inside the detectors or created in simulations. The Application manager loops over the input data events and executes the algorithms. If there are any errors in the execution, the Application manager will handle them and in the end it will terminate the job. The module Algorithms (Algorithms), has a central place in the job execution, as it performs data processing on all input data. At the end, output files are produced in a form of another data file or other type of output.
The Gaudi services provide various utilities and services for the Algorithms in the system, which are also initialized by the Application manager at the beginning of a job. Normally, only one instance of a service is required in the job. There is a number of different services within the framework that can used by the Algorithms but some of the main ones are:
Event Data service (EventDataSvc) and Histogram service
(HistogramDataSvc) that read and process individual collision events,
Detector Data service (DetDataSvc) for capturing detector data,
Message service (MessageSvc) logs progress or errors in the Algorithms and
Tool Service (ToolSvc) manages Algorithm tools, which are required during the Algorithm execution.
The Persistency services allow writing the output data on the disk as presented in the figure. Finally, there are many other services in the framework that provide specialized functions that can be enabled and disabled by the users. Each of the service classes is used by the Algorithms via an interface, which is a helper class that defines the functionality of a service through a number of public methods. These methods also allow the communication of the service with other components of the framework.
When using the Gaudi framework, the physics analysts specify job configurations in one or more python application configuration files. These configurations are passed by the Job Options Service (JobOptionsSvc) and applied at run-time. Every Algorithm used in the framework is configured using these options.
3 Implementation of the
The provenance tracking service, called Metadata service (MetaDataSvc) [url3, mdatasvc], is implemented as a Gaudi service inside the module Gaudi services (GaudiSvc). Its functionality is simple: collect information about a job and capture it in an object, which is then stored as metadata in the output data file. There is a number of LHCb data formats in which an output of a Gaudi job can be stored. Data formats that are most commonly used in high-energy physics analysis are based on ROOT [brun1997root]. A ROOT file format is very flexible and it acts like a UNIX file directory, meaning that it can store directories and data objects organized in an arbitrary order [brun2007root]. Furthermore, it can store any C++ object, like for example histograms, plots, tables and other. The metadata object, named info, is implemented as a dictionary (a map std::map in C++), where the keys capture names of application configurations, and the values capture their information.
The service is implemented in a C++ class with the following main methods that execute its workflow during the run-time:
isEnabled captures information whether the service is enabled in the job or not.
start initiates the service and calls collectData.
collectData executes the main functionality of the service. It traverses and queries the Gaudi tools (ToolSvc), services (Services), algorithms (Algorithms) and Job Options (JobOptionsSvc) to capture their configuration.
getMetaData returns the object that stores a dictionary of the job configurations.
Within the Gaudi framework, there are three audit methods that follow a job execution. They are automatically invoked by the Application manager at the start of every job. Those methods are:
initialize that initializes algorithms and services, and applies job options,
execute that executes the main function of the job,
finalize that is called at the end of the job.
The Metadata service is called and initialized from the finalize method. This is because initialize or execute
methods run at the time when not all internal job configurations are applied, hence calling the service from these methods would cause an information loss. Therefore, the metadata is only be captured once the components and configurations are assigned to the job, at the moment when the output ROOT file is written to the disk. The service functionality is finished when the metadata dictionary is also saved into the file.
4 Using the Provenance
Even though LHCb data analyses can be done using a wide range of tools and programming languages, retrieving the data from the Grid needs to be done using the LHCb software. Typically an LHCb application called DaVinci [lhcb2017davinci] is used in this step. DaVinci is a physics analysis application that is primarily used as a part of data processing to calculate a variety of kinematic quantities for recorded particles, but it is also used in fine data selections to extract particle decays of interest.
Capturing data provenance within the Gaudi framework is the simplest way for the analysts to conduct reproducible analysis. The provenance captured with the Metadata service is useful in a number of different scenarios:
The first scenario is directly linked to reproducing a dataset using the original application version. A common practice in using the LHCb software is to use the latest available version at the time. This is recommended because the latest version captures recent developments with new features or solutions to known bugs. However, by doing this the analysts do not necessarily record what application version they used, which may at the later stage, when the “latest” version changes, hinder reproducibility. Even though the applications run on the same framework, two different versions can produce different datasets. For example, the application DaVinci v39r0 and DaVinci v42r3 may produce slightly different outputs even when they use the same application configuration file.
Another common practice in conducting physics analyses is to use one application configuration file to produce a number of similar datasets with slight configuration changes. This is typical when for example an analyst wants to test different selections on the data. This practice produces a number of data files without a clear information how they were produced, and if an analyst by mistake mislabels them (by marking, for example, the origin year of data 2012 instead of 2011) mistakes in the physics analysis may happen.
Furthermore, working with a number of collaborators on an analysis typically means sharing disk work space on a CERN or University server. Each of the collaborators creates temporary or derived data files, and often after some time they forget how and why these files were created. This hinders reproducibility as it conceals potential steps that were previously taken in the analysis workflow.
Finally, if a bug in one application version is identified, the analysts need to know whether they had created datasets using this version and whether it could negatively affect their analysis. If provenance of these datasets is available, they could instantly evaluate whether this is the case and recreate the datasets with another application version.
The Metadata dictionary provides a clear information what application version was used and how it was used, thus avoiding the ambiguity illustrated in these scenarios. Our solution can be immensely useful for everyday development and validation of research.
Enabling the service in python application configuration files is very simple, as it requires adding one line of code in the existing configuration:
ApplicationMgr().ExtSvc += [ ’Gaudi::MetaDataSvc’ ]
This line of code assigns an external Metadata service to the Application manager. Once a python application configuration file is created, it is passed to Gaudi to run the jobs. A ROOT output dataset that captures data of a particle decay is shown in Figure 2. The dataset was created using the application DaVinci v42r3. This information and other configurations are captured in the additional info object, which is also saved and visible in the figure.
The info file can be read in two different ways. The first way is through command-line, by simply reading and printing the dictionary. The second way is to view the info file from a stand-alone provenance viewer, as shown in Figure 3. It is implemented as a pop-up window based on C++ and ROOT. This means that it requires the ROOT framework (and an input ROOT file) to be executed. The viewer uses a table to present the key-value pairs of metadata captured in the job. The viewer allows for a user-friendly way to see the provenance, while the command-line approach is better when there is a need to reproduce the dataset.
When it is necessary to reproduce a ROOT dataset, the original job can be recreated from the information within the dataset. This is done by extracting the metadata from the dataset and saving it in a file as a “flat” list of options. This is essentially a sequence of options given line by line as they are read from the info file in a command line. The list of options can be saved as a python file, python pickle file or Linux configuration options file. These file formats are typically used for serializing python objects. Gaudi understands and processes the flat list in the same way as the original application configuration file. Furthermore, the original application configuration file is no longer needed. Even though DaVinci is most commonly used for physics analyses at LHCb, the Metadata service can be used for other applications within the framework that can process and produce ROOT datasets. The service was first released in Gaudi v27r1, meaning that it is available for use in physics analyses with DaVinci v40r0 onward.
We introduce a new development in the LHCb software framework Gaudi that captures provenance of a job and stores it directly within the output dataset. It represents a neat idea to facilitate reproducibility of research in a simple way, as it does not require installing third-party software but it is integrated into the existing framework that is actively used by LHCb researchers. We demonstrate a number of scenarios in which the provenance service would be useful and present our implementation with a hope that it inspires other reproducibility-focused contributions that seamlessly mash into researchers’ work in different disciplines.
Ana Trisovic acknowledges funding from the CERN Doctoral Student program. She is currently funded in part by the Sloan Foundation.