DLPaper2Code: Auto-generation of Code from Deep Learning Research Papers

by   Akshay Sethi, et al.

With an abundance of research papers in deep learning, reproducibility or adoption of the existing works becomes a challenge. This is due to the lack of open source implementations provided by the authors. Further, re-implementing research papers in a different library is a daunting task. To address these challenges, we propose a novel extensible approach, DLPaper2Code, to extract and understand deep learning design flow diagrams and tables available in a research paper and convert them to an abstract computational graph. The extracted computational graph is then converted into execution ready source code in both Keras and Caffe, in real-time. An arXiv-like website is created where the automatically generated designs is made publicly available for 5,000 research papers. The generated designs could be rated and edited using an intuitive drag-and-drop UI framework in a crowdsourced manner. To evaluate our approach, we create a simulated dataset with over 216,000 valid design visualizations using a manually defined grammar. Experiments on the simulated dataset show that the proposed framework provide more than 93% accuracy in flow diagram content extraction.



There are no comments yet.


page 2


Neural Content Extraction for Poster Generation of Scientific Papers

The problem of poster generation for scientific papers is under-investig...

Reconciler: A Workflow for Certifying Computational Research Reproducibility

Previous work in reproducibility focused on providing frameworks to make...

A Data Driven Approach for Compound Figure Separation Using Convolutional Neural Networks

A key problem in automatic analysis and understanding of scientific pape...

Benchmarking Scientific Image Forgery Detectors

The scientific image integrity area presents a challenging research bott...

Onssen: an open-source speech separation and enhancement library

Speech separation is an essential task for multi-talker speech recogniti...

Toward a Mechanized Compendium of Gradual Typing

The research on gradual typing has grown considerably over the last deca...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The growth of deep learning (DL) in the field of artificial intelligence has been astounding in the last decade with about

research papers being published since 111https://scholar.google.co.in/scholar?as_sdt=1,5&q=%22deep+learning%22&hl=en&as_ylo=2016&as_vis=1. Keeping up with the growing literature has been a real struggle for researchers and practitioners. In one of the recent AI conferences, NIPS , the maximum number of papers submitted (

) were in the topic, “Deep Learning or Neural Networks”. However, a majority of these research papers are not accompanied by their corresponding implementations. In NIPS

, only () papers made their source implementation available222https://www.kaggle.com/benhamner/nips-papers. Implementing research papers takes at least a few days of effort for software engineers assuming that they have limited knowledge in DL [Sankaran et al.2011].

Another major challenge is the availability of various libraries in multiple programming langauges to implement DL algorithms such as Tensorflow 

[Abadi et al.2016]

, Theano 

[Bastien et al.2012], Caffe [Jia et al.2014]

, Torch 

[et al2011], MXNet [Chen2015], DL4J [Gibson2015], CNTK [Seide and Agarwal2016] and wrappers such as Keras [Chollet and others2015], Lasagne [Dieleman2015]

, and PyTorch 

[Chintala2016]. The public implementations of the DL research papers are available in various libraries offering very little interoperability or communication among them. Consider a use-case for a researcher working in “image captioning”, where three of the highly referred research papers for the problem of image captioning333https://competitions.codalab.org/competitions/3221#results are:

  1. Show and Tell [Vinyals et al.2015]: Original implementation available in Theano; https://github.com/kelvinxu/arctic-captions

  2. NeuralTalk2 [Karpathy and Fei-Fei2015]: Original implementation available in Torch; https://github.com/karpathy/neuraltalk2

  3. LRCN [Donahue et al.2015]: Original implementation available in Caffe; http://jeffdonahue.com/lrcn/

As the implementations are available in different libraries, a researcher cannot directly combine the models. Also, for a practitioner having remaining of the code-base in Java (DL4J) directly leveraging either of these public implementations would be daunting. Thus, we highlight two highly overlooked challenges in DL:

  1. Lack of public implementation available for existing research works and thus, the time incurred in reproducing their results

  2. Existing implementations are confined to a single (or few) libraries limiting portability into other popular libraries for DL implementation.

Figure 1: The architecture of the proposed creative system that extracts and understands the flow diagram from a deep learning research paper and generates an execution ready deep learning code in two differcaffent platforms: Keras and Caffe.

We observed that most of the research paper explains the DL model design either through a figure or a table. Thus, in this research we propose a novel algorithm that automatically parses a research paper to extract the described DL model design. The design is represented as an abstract computational graph which is independent of the implemenation library or language. Finally, the source code is generated in multiple libraries from this abstract computational graph of the DL design. The results are shown by automatically generating the source code of arXiv papers both in CAFFE (prototxt) and Keras (python). However, evaluating the generated source code is debatable due to the lack of ground truth. To overcome this challenge, we simulated a large image dataset of valid DL model designs in both Caffe and Keras. To generate DL visualizations, we manually defined a grammar for DL models. As these visualizations are highly varying, they are comparable to the figures present in research papers. Thus, the major research contributions are:

  1. a technique to automatically understand a DL model design by parsing the figures and tables in a research paper,

  2. generate source code both in both Keras and Caffe from the abstract computation graph of a DL design,

  3. automatically generate design for arXiv papers and build a UI system for editing them the crowdsourced way,

  4. on a simulated dataset of DL model visualizations using a manually defined grammar, evaluate the proposed approach to achieve more than accuracy.

The rest of the paper is organized as follows: Section 2 explains the entire proposed approach for auto generation of DL source code from research paper, Section 3 talks about the simulated dataset and its experimental performance of the individual components of the proposed approach, Section 4 discusses the experimental results on arXiv DL papers, and Section 5 concludes this paper with some discussion on our future efforts.

Proposed Approach

Consider a state-of-art paper DL paper [Szegedy et al.2017] published in AAAI 2017, which explains the DL design model through figures, as shown in Figure 3(a). Similarly, in the AAAI 2017 paper by [Parkhi et al.2015], the DL model design was explained using a table. Thus, given the PDF of a research paper in deep learning, the proposed DLPaper2Code architecture consists of five major steps, as shown in Figure 1

: (i) Extract all the figures and tables from a research paper. Handling the figure and table content are done independently, although the follow a similar pipeline, (ii) As there could be other descriptive image and results tables in a paper, classify each figure or table whether it contains a DL model design. Also, perform a fine grained classification on the type of figure or table used to describe the DL model, (iii) Extract the flow and the text information from the figures and tables, independently, (iv) Construct an abstract computational graph which is independent of the implementation library, and (v) generate source code in both Caffe and Keras from the computational graph.

Figure 2: Characterizing the DL model designs available in research papers and grouping them into five different categories.

Characterizing Research Papers

We observed that in a research paper the DL design is mostly explained using figures or tables. Thus, it is our assertion that by parsing the figure, as an image, and the table content, the respective novel DL design could be obtained. The primary challenges with the figures in research papers is that the DL design figures typically do not follow any definition and show extreme variations. Similarly, tables can have different structures and can entail different kind of information.

We manually observed more than images from research papers and characterized the DL model deisgn images into five broad categories, as shown in Figure 2

. The five types are: (i) Neurons plot: the classical representation of a neural network with each layer having circular nodes inside them, (ii) 2D Box: each hidden layer is represented as a 2D rectangular box, (iii) Stacked2D Box: each layer is represented as a stack of 2D rectangular boxes, describing the depth of the layer, (iv) 3D Box: each hidden layer is represented as a 3D cuboid structure, and (v) Pipeline plot: along with the DL model design, the entire pipeline and mostly some intermediate results of image/ text is shown as well. Similarly, based on the representation, tables can be classified as, (i) row-major table: where the DL model design flows along the row 

[Springenberg et al.2014], and (ii) where the DL model design flows along the column [Parkhi et al.2015]. It is essential to account for these variations in the proposed pipeline, as they indicate the DL design flow represented in the paper. Following this assumption, the proposed approach does not identify a DL design flow that is neither in a table nor in a figure.

Extracting Figures and Tables

Extracting visual figures from a PDF document, especially a scholarly report is a well studied problem [Choudhury and Giles2015]

. Common challenges includes extracting vector images as they are embedded in the PDF document and extracting a large figure as a whole instead of multiple figures of its components. To this end, we have used a publicly existing tool called

PDFFigures 2.0444https://github.com/allenai/pdffigures2 [Clark and Divvala2016] for extracting a list of figures from a scholarly paper. However, none of the existing open source tools maintain the table structure that is essential for us. Thus, we built a PDF table extraction tool using PDFMiner555https://euske.github.io/pdfminer/ and Poppler-utils666https://poppler.freedesktop.org/

. Poppler-utils provide high level information about the document such as the text dump, while using PDFMiner, certain low level document details such as vertical line spacing are obtained. The table structure, along with the table caption, is retrived by building the heuristics over the horizontal and vertical line spacing.

Figure and Table Classification

The aim is to classify and retrive only those figures and tables in a research paper that contains a DL design flow. Futher, a fine-grained classifier is required to classify the figure into one of the identified five broad categories and classify the table as a row-major or column-major flow.

In case of figures, the classifier is trained to perform the prediction using the architecture shape and the flow. For example, figures having result graphs and showing sample images from dataset has different shape compared to an architecture flow diagram. All the figures are resized to and features (fc2) are extracted from a fully connected layer of a popular deep learning model VGG19 [Simonyan and Zisserman2014]

pre-trained on ImageNet dataset. We have two classification levels: (i) Coarse classifier: a binary neural network (NNet) classifier trained on

fc2 features to classify if the figure contains a DL model or not, and (ii) Fine-grained classifier: a five class neural network classifier trained on fc2 features to identify the type of DL design, only for those figures classified positive by the coarse classifier. Having a sequence of two classifiers provided better performance as compared to a single classifier with six classes (sixth class being no DL design flow).

In case of tables, a bag-of-words model is built using keywords from the caption text as well as the table text. A cosine distance based classifier is used to identify if there is a DL design flow in the given table as compared to tables containing results. Further based on the number of rows and columns in the table, as extracted in the previous section, the table is classified as a row-major or column-major flow.

Content Extraction from Figure

Content extraction from figures has two major steps: (i) Flow detection to identify the nodes and the edges, and (ii) OCR to extract the flow content. Identifying the flow is the challenges, as there is a huge variation in the type of DL design flow diagrams. In this section, we explain the details of the approach for a 2D Box type, as shown in Figure 3

, while similar approach could be extended to other types, as well. Flow detection involves identifying the nodes first, followed by the edges connecting the nodes. As the image is usually of high resolution and quality, they are directly binarized using an adaptive Gaussian thresholding and a Canny edge detection approach is used to identify all the lines. An iterative region grown algorithm is adopted to identify closed countours in the figure, as they represent the nodes as shown in Figure 

3(b). All the detected nodes are masked out from the figure and the contour detection algorithm is applied again to detection the edges, as shown in Figure 3(d). The direction of the edge flow is obtained by analyzing the pixel distribution within each edge contour. The node and edge contours are then sorted based on the location and direction to obtain the flow of the entire DL model design. As shown in Figure 3, the proposed approach could also handle branchings and forking in a design flow diagram.

Once the flow is extracted, the text in each node/ layer is obtained through OCR using Tesseract777https://github.com/tesseract-ocr/.Based on our manual observation, we assume that the a layer description would be available within the detected node. A dictionary of possible DL layer names is created to perform spell correction of the extracted OCR text.

Figure 3: Illustration of the proposed flow detection approach from complex figures [Szegedy et al.2017] (AAAI 2017) involving (b) node/ layer detection, and (d) edge detection.
Figure 4: An example table showing the DL design flow as explained in tabular format in [Parkhi et al.2015].

Content Extraction from Table

In a row major table, every row corresponds to a layer in the DL design flow, as shown in Figure 4. Similarly, in a column major table, every column corresponds to a layer along with other parameters of the layer. The layer name is extracted by matching it with a manually created dictionary. Further, the parameters are extracted by mapping the corresponding row or column header with a pre-defined list of parameter names corresponding to the layer. Thus, sequentially the entire DL design flow is extracted from a table.

Figure 5: An illustration for a Pooling2D layer showing the rule base of the inference engine, converting the abstract JSON format into Caffe protobuf and Keras python code.

Generating Source Code

Overall, after detecting DL design flow, an abstract computational graph is represented in JSON format, as shown in Figure 5. Two rule based converters are written to convert the abstract computational graph extracted in the previous step to either (i) Keras code (Python) or (ii) Caffe protobuf format (prototxt). An inference engine acts as the convertor to map the abstract computational graph to the grammar of the platform. The inference engine consists of a comprehensive list of templates and dictionaries built manually for both Keras and Caffe. Template based structures transfer each component from the abstract representation to a platform specific structure using the dictionary mappings. Further, another set of templates, consisting of a set of assertions, are designed for translating each layer’s parameters. Further, the inference engine is highly flexible allowing easy extension and addition of new layer definitions. An example of the inference engine’s dictionary mapping for a Pooling2D layer is shown in Figure 5.

Thus for a given research paper, by processing both the figure and table content, the DL model design flow is obtained which is converted to execution ready code in both Keras and Caffe.

Current Layer Dense Conv2D Flatten Dropout MaxPool AvgPool Concat Embed RNN RNN (seq) LSTM LSTM (seq)
Dropout Same as previous layer
Concat If input is one dimensional, same as Dense layer; else same as previous layer
RNN (seq)
LSTM (seq)
Table 1: The proposed grammar for creating valid deep learning design models defining the list of possible next layers for a given current layer.
Layer Hyper-parameters
Dense #nodes - {[5:5:500]}
Dropout probability - {[0:0.1:1]}
Conv2D #filters - {[16:16:256]}
filter size - {[1:2:11]}
MaxPool stride - {[2:1:5]}
filter size - {[1:2:11]}
AvgPool stride - {[2:1:5]}
filter size - {[1:2:11]}
Embed embed size - {64, 100, 128, 200}
vocab - {[10000, 20000, 50000, 75000]}
SimpleRNN #units - {[3:1:25]}
LSTM #nodes - {[3:1:25]}
InputData MNIST - {28, 28, 1}
CIFAR-10 - {32, 32, 3}
ImageNet - {224, 224, 3}
Table 2: The set of hyper-parameter options considered for each layer in our simulated dataset generation. The parameter value [a:b:c] means a list of values from to in steps of .

Evaluation on Simulated Dataset

The aim of this process is to simulate and generate ground truth deep learning designs and their corresponding flow visualizations figures. Thus, the proposed pipeline of DL model design could be quantitatively evaluated. To this end, we observed that both Keras and Caffe have an in-built visualization routine for a DL design model. Further both Keras and Caffe have their internal DL model validator and a visualization can be exported only when the simulated design is deemed valid by the validator.

Grammar for DL Design Generation

To be able to generate meaningful and valid DL design models, we manually defined a grammar for the model flow as well as for the hyper-parameters. We considered unique layers for our dataset simulation - {Conv2D, MaxPool2D, AvgPool2D

} for building convolution neural network like architectures, {

Embed, SimpleRNN, LSTM

} for building recurrent neural network like architectures, {

Dense, Flatten, Dropout, Concat} as the core layers. The use of Concat enables our designed models to be non-sequential as well as with a combination of recurrent and convolution architectures. This allows us to create random, complex, and highly varying DL models. Also, RNN and LSTM layers have an additional binary parameter of return seq, which when set true returns the output of every hidden cell, otherwise, returns the output of only the last hidden cell in the layer. Table 1

explains the proposed grammar for generating DL design models. The grammar defines the set of all possible next layers for a given current layer. This is determined by the shape of the tensor flowing through each of the layer’s operation. For example, a

Dense layer strictly expect the input to be a vector of shape . Thus, the Dense cannot appear after a Conv2D layer without the presence of a Flatten layer. The proposed grammar further includes the set of possible values for each hyper-parameter of a layer, as explained in Table 2. While hyper-parameter values beyond the defined bounds are possible, the table values indicate the assumed set of values in the model simulation process.

Simulated Dataset

A model simulation starts with an Input layer, where there are four possible options - MNIST, CIFAR, ImageNet, IMDBText. From the set of all possible next layers for the given Input layer, a completely random layer is decided. For the given next layer, a random value is picked for every possible hyper-parameter. For example, for MNIST being the input layer, Conv2D could be picked as the random next layer. Then, for Conv2D the hypar-parameters are determined randomly as filters, filter size with a stride of . The model design always ends with a Dense layer with number of nodes equal to the number of classes of the corresponding Input layer.

The number of layers in between the Input layer and the final Dense layer denotes the depth of the DL model. For our simulation, we generated DL models for each depth varied between and , creating a total dataset of models. Each model contains the Keras JSON representation, Keras image visualization, Caffe protobuf files, and Caffe image visualization, resulting in a total of DL model design visualizations. These models are valid by construct since they follow a well-defined grammar. However, these models need not be the best from an execution perspective, or with respect to their training performance.

Observation Train Validation Test
Naive Bayes
Decision Tree
Logistic Regression
SVM (RBF Kernel)
Neural Network
Table 3: The performance of various binary classifiers to distinguish KerasCaffeVisulizations vs. other often occuring images in research papers.

Figure Type Classification Performance

In this experiment, a binary NNet classifier with two hidden layers of size [, ] is trained on fc2 features of VGG19 model to differentiate simulated DL visualizations from a set of other kind of diagrams often available in research papers (scraped from PDF). The whole dataset is split as for training, validation, and for testing, making it a total of images for training and images for testing. The performance of the NNet classifier is compared with six different classifiers as shown in Table 3. As it can be observed most of the classifier provide a classification accuracy of , showing that from a set of figures obtained from a research paper, it would be possible to distinguish only the deep learning design flow diagrams. All the classifiers use the default parameters as provided the scikit-learn package.

Computational Graph Extraction Performance

In this experiment, the performance of flow and content extraction from the Keras and Caffe visualizations is evaluated against the ground truth. By performing OCR, on the extracted flow, the unique layer names are obtained and two detection accuracies are reported,

  1. blob (or layer) detection accuracy: evaluates the performance of blob detection and layers identified using OCR and is computed as the ratio of correct blobs detected per model (in percent)

  2. edge detection accuracy: evaluates the performance of the detected flow and is computed as the ratio of correct arrows detected per model (in percent)

Figure 6 is the box plot showing the performance for the the proposed figure extraction pipeline in both Keras and Caffe. As it can be observed, the proposed pipeline provides accuracy in layer extraction and more than accuracy in extracting the edges. As the edges can be curved and can be of any length, even connecting the first with the last layer, the variations caused a reduction in performance. Further details on extracting flow information Keras and Caffe visuazliation and also for additional results, kindly refer to the supplementary material.

Figure 6: Box plots showing the performance accuracy of flow detection in Keras and Caffe visualizations.

Results on Deep Learning Scholarly Papers

The first papers were downloaded from arXiv.org using “deep learning” as the input query. figures were extracted from these downloaded papers, out of which figures did not contain a DL design flow while the remaining contained. These represent the usual figures that are found in a deep learning research paper that does not contain a design flow.

Figure Type Classification Accuracy

To evaluate the coarse level binary classification, a 2 hidden layer NNet was trained on the fc2 features obtained from the images extracted from research papers. The whole dataset is split as for training, validation, and for testing and the results are computed for seven different classifiers as shown in Table 4.

Further, to evaluate the fine level, five-class, figure type classification, the DL design flow diagrams were manually labelled. The distribution of figures were as follows: (i) Neurons plot: figures, (ii) 2D box: , (iii) Stacked2D box: , (iv) 3D box: , and (v) Pipeline plot: . A train, validation, and test split is performed to train the NNet classifier in comparison with the six other classifier to perform this five class classification. The results are table in Table 5. The results show that even on highly varying DL flow design images, identifying the type of DL flow is more than accurate. For more details on the experimental analysis, please refer to the supplementary material.

Observation Train Validation Test
Naive Bayes
Decision Tree
Logistic Regression
SVM (RBF Kernel)
Neural Network
Table 4: The performance of coarse level binary classifier to distinguish DL design flow figures from other figures that usually appear in a research paper.
Observation Train Validation Test
Naive Bayes
Decision Tree
Logistic Regression
SVM (RBF Kernel)
Neural Network
Table 5: The performance of fine level, five class classifier to identify the type of DL design flow figure obtained from the research paper.
Figure 7: An arXiv-like website where DL papers along with their extracted design, and generated source code in Caffe and Keras is made available.
Figure 8: An intuitive drag-and-drop UI based framework to edit the extracted designs and make them publicly available.

Crowdsourced Improvement of Extracted Designs

Using the proposed DLPaper2Code framework, we extracted the DL design models for all the downlaoded papers. However, quantitatively evaluating the extracted design flow would be challenging due to the lack of a ground truth. Hence, we created an arXiv-like website, as shown in Figure 7, where the papers, the corresponding design, and the generated source code is available. The DL community could rate the extracted designs which acts as a feedback or performance measure of our automated approach.

Further, an intuitive drag-and-drop based UI framekwork is generated for the community to edit the generated DL flow design, as shown in Figure 8. Ideally the respective papers’ author or the DL community could edit the generated designs, wherever an error was found. The edited design could be further made publicly available for other researcher to reproduce their design. Further our system could generate the source code of the edited design in both Keras and Caffe, in real-time. Thus, we have a two-fold advantange through this UI system: (i) the public system could act as a one-stop repository for any DL paper and it’s corresponding design flow and source code, (ii) the feedback the community provides would enable us to continuously learn and improve the system888To adhere to the double-blind submission policy of AAAI, the system URL is not provided in this version of the paper.

Conclusion and Discussion

Thus, researchers and developers need not struggle any further in reproducing research papers in deep learning. Using this research, the DL design explained in a research paper could be automatedly extracted. Using an intuitive drag-and-drop based UI editor, that we designed as a part of this research, the extracted design could be manually edited and perfected. Further for an extracted DL design, the source code could be generated in Keras (Python) and Caffe (prototxt), in real-time. The proposed DLpaper2Code framework extracts both figure and table information from a research paper and converts it into source code. Currently, an arXiv-like website is created that contains the DL design and the source code for research papers. To evaluate our approach, we simulated a dataset of unique deep learning designs validated by a proposed grammar and their corresponding Keras and Caffe visualizations. On a total dataset of deep learning model visualization diagrams and diagrams that appeared in deep learning research papers and did not contains a model visualization, the proposed binary classification using NNet classifier obtained accuracy. The performance of extracting the generic computational graph figures using the proposed pipeline is more than . While this research could have a high impact in the reproducibility of DL research, we have planned for plenty of possible extensions for the proposed pipeline:

  1. The proposed pipeline detects only the layers (blobs) and the edges from the diagram. It could be extended to detect and extract the hyper-parameter values of each layer, to make the computational graph more content rich.

  2. Currently, we have two independent pipelines for generating abstract computational graphs from tables and figures. Combining the information obtained from the multi-modal sources could enhance the accuracy of the extracted DL design flow.

  3. The entire DLPaper2Code framework could be extended to support additional libraries, apart from Keras and Caffe, such as Torch, Tensorflow etc.

  4. The broader aim would be to propose a definition of representating DL model design in research papers, achieving uniformity and better readibility. Further, authors of future papers could also release their design in the created website for easy accessibility to the community.