pylustrator: Code generation for reproducible figures for publication

10/01/2019 ∙ by Richard Gerum, et al. ∙ FAU 0

One major challenge in science is to make all results potentially reproducible. Thus, along with the raw data, every step from basic processing of the data, evaluation, to the generation of the figures, has to be documented as clearly as possible. While there are many programming libraries that cover the basic processing and plotting steps (e.g. Matplotlib in Python), no library yet addresses the reproducible composing of single plots into meaningful figures for publication. Thus, up to now it is still state-of-the-art to generate publishable figures using image-processing or vector-drawing software leading to unwanted alterations of the presented data in the worst case and to figure quality reduction in the best case. Pylustrator a open source library based on the Matplotlib aims to fill this gap and provides a tool to easily generate the code necessary to compose publication figures from single plots. It provides a graphical user interface where the user can interactively compose the figures. All changes are tracked and converted to code that is automatically integrated into the calling script file. Thus, this software provides the missing link from raw data to the complete plot published in scientific journals and thus contributes to the transparency of the complete evaluation procedure.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, more and more researchers have called attention to a growing "reproducibility crisis" [8]. An important factor that contributes to problems in reproducing results from published studies is the unavailability of the raw data from the original experiment as well as the unavailability of the methods or the code used for the evaluation the raw data [1]. One major step to overcome these shortcomings is the publication of all raw data as well as a documented version of the code used for evaluation [2]. The ideal case would be that anyone interested can download the raw data and exactly reproduce the figures of the publication.

To address the issue of data availability, researchers are encouraged to provide their data in online repositories, e.g. dryad [4]. However, this data is useless unless the complete evaluation procedure in the terms of all evaluation and visualisation steps can be comprehended by other scientists. The best way to to so is to provide a complete well documented evaluation code, including all important steps from basic artifact corrections up to the final plot to be published. Open Source scripting languages like Python [11] or R [7] are ideal for such code as open source languages are accessible for everyone. Furthermore, interpreted languages do not need to be compiled, therefore have less obstacles for the user to run the code. The last part of the evaluation of the data is the visualisation, which is crucial to communicate the results [10]. This paper deals with the visualization step which consists of two parts: the generation of simple plots from data and composing meaningful figures from these plots.

The first part of generating the building blocks of figures, the plots, is already covered in various toolkits, e.g. Matplotlib [6], Bokeh [3] or Seaborn [12]. But to generate reproducible figures from simple plots, no convenient toolkit is yet available. Matplotlib already offers figures composed of multiple subplots, but to generate a complete figure ready for publication a lot of code is needed to add all formatting, annotations and styling commands. Therefore, this approach is often not followed as it is impractical for real applications. Users often prefer graphical tools such as image manipulation software, e.g. GIMP [5] or Inkscape [9]. These offer great flexibility, but cannot provide a reproducible way of generating figures and bear the danger of accidentally changing data points. Also important to note is that by using an image manipulation software, any small change to the evaluation requires to re-edit the figure in the image manipulation software. A process that slows down the creation of figures and is prone to errors.

2 Algorithm and Exemplary Results

Pylustrator was developed to address this issue. A tool to fill the gap from single plots to complete figures, by a code generation algorithm, that turns user input into python code for reproducible figure assembly (Fig. 1). Small changes to the evaluation or new data only require to run the code again to update the figure.

Figure 1: Example how code for composing a figure can be generated with pylustrator.

Using “pylustrator“ in any given Python file that uses Matplotlib do plot data, simply requires the addition of only two lines of code:

    import pylustrator
    pylustrator.start()

The Matplotlib figure is then displayed in an interactive window (Fig. 2) when the plt.show() command is called. In this interactive window, pylustrator enables the user to:

  • resize and position plots by mouse-dragging

  • adjust the position of plots legends

  • align elements easily by automatic "snapping"

  • resize the complete figure in cm/inches

  • add text and annotations, and change their style and color

  • adjust plot ticks and tick labels

Figure 2: The interface of pylustrator. The user can view the elements of the plot, edit their properties, edit them in the plot preview and experiment with different color schemes.

Pylustrator tracks all these changes to translate them into python code. Every change is split in four parts: the command object, the command text, the target object and the target command. The command object is the object instance (e.g. the Axes object) that has a method to call for this change and the command text is the methods name together with the parameters (e.g. ".annotate(’New Annotation’)"). The target object is the object that is affected by the command. In most cases this is the same as the command object, but in some cases when new child objects are created the target object is the child object. The target command is the methods name without the parameters.

Command objects are "serialized" by iteratively going up the parent-child tree from e.g. a text to the axis to the figure and generating a python command from this dependency (e.g. ‘plt.figure(1).axes[0].texts[0]‘, the first text of the first axes of figure 1). When saving, pylustrator introspects its current stack to find the line of code from where it was called and inserts the automatically generated code directly before the command calling pylustrator.

When loading a file with automatically generated code, pylustrator splits all the automatically generated lines into reference objects and reference commands. New changes where both the reference object and the reference command match a previous change, the previous change is overwritten. This ensures that previously generated code can be loaded appropriately and saving the same figure multiple times generates the code only once.

It is important to note that the automatically generated code only relies on Matplotlib and does not need the pylustrator package anymore. Thus, the pylustrator import can later be removed to allow to share the code without an additional introduced dependency.

The documentation of pylustrator can be found on https://pylustrator.readthedocs.org.

3 Conclusion

This study introduces a novel method to create publishable figures from single plots based on an open source Python library called pylustrator. The figures can be arranged by drag and drop and the pylustrator library produces the according code. Thus, this library provides a valuable contribution to tackle the reproducibility crisis.

4 Acknowledgements

We acknowledge testing, support and feedback from Christoph Mark, Sebastian Richter, and Achim Schilling and Ronny Reimann for the design of the Pylustrator Logo.

References

  • [1] M. Baker and D. Penny (2016-05) Is there a reproducibility crisis?. Vol. 533. External Links: Document, ISSN 14764687, Link Cited by: §1.
  • [2] M. Baker (2016) Why scientists must share their research code. Nature News. External Links: Document Cited by: §1.
  • [3] Bokeh Development Team (2019) Bokeh: python library for interactive visualization. External Links: Link Cited by: §1.
  • [4] Dryad. Note: https://datadryad.orgAccessed: 2019-09-19 Cited by: §1.
  • [5] GIMP Development Team (2019) GIMP:gnu image manipulation program. External Links: Link Cited by: §1.
  • [6] J. D. Hunter (2007) Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 9 (3), pp. 90–95. External Links: Document Cited by: §1.
  • [7] R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. External Links: Link Cited by: §1.
  • [8] F. Sayre and A. Riegelman (2018) The reproducibility crisis and academic libraries. College & Research Libraries 79 (1), pp. 2. External Links: ISSN 2150-6701, Document, Link Cited by: §1.
  • [9] T. I. Team (2019) Inkscape. External Links: Link Cited by: §1.
  • [10] E. Tufte (1893) The visual display of quantitative information. Graphics Press, Cheshire, Connecticut. Cited by: §1.
  • [11] G. Van Rossum and F. L. Drake Jr (1995) Python tutorial. Centrum voor Wiskunde en Informatica Amsterdam, The Netherlands. Cited by: §1.
  • [12] M. Waskom, O. Botvinnik, D. O’Kane, P. Hobson, S. Lukauskas, D. C. Gemperline, T. Augspurger, Y. Halchenko, J. B. Cole, J. Warmenhoven, J. de Ruiter, C. Pye, S. Hoyer, J. Vanderplas, S. Villalba, G. Kunter, E. Quintero, P. Bachant, M. Martin, K. Meyer, A. Miles, Y. Ram, T. Yarkoni, M. L. Williams, C. Evans, C. Fitzgerald, Brian, C. Fonnesbeck, A. Lee, and A. Qalieh (2017-09) Mwaskom/seaborn: v0.8.1 (september 2017). External Links: Document, Link Cited by: §1.