ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning

02/23/2020
by   Martijn Oldenhof, et al.
0

In drug discovery, knowledge of the graph structure of chemical compounds is essential. Many thousands of scientific articles in chemistry and pharmaceutical sciences have investigated chemical compounds, but in cases the details of the structure of these chemical compounds is published only as an images. A tool to analyze these images automatically and convert them into a chemical graph structure would be useful for many applications, such drug discovery. A few such tools are available and they are mostly derived from optical character recognition. However, our evaluation of the performance of those tools reveals that they make often mistakes in detecting the correct bond multiplicity and stereochemical information. In addition, errors sometimes even lead to missing atoms in the resulting graph. In our work, we address these issues by developing a compound recognition method based on machine learning. More specifically, we develop a deep neural network model for optical compound recognition. The deep learning solution presented here consists of a segmentation model, followed by three classification models that predict atom locations, bonds and charges. Furthermore, this model not only predicts the graph structure of the molecule but also produces all information necessary to relate each component of the resulting graph to the source image. This solution is scalable and could rapidly process thousands of images. Finally, we compare empirically the proposed method to a well-established tool and observe significant error reductions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/22/2021

Differentiable Scaffolding Tree for Molecular Optimization

The structural design of functional molecules, also called molecular opt...
07/25/2019

Graph Informer Networks for Molecules

In machine learning, chemical molecules are often represented by sparse ...
10/05/2017

How Much Chemistry Does a Deep Neural Network Need to Know to Make Accurate Predictions?

In the last few years, we have seen the rise of deep learning applicatio...
10/10/2015

AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

Deep convolutional neural networks comprise a subclass of deep neural ne...
03/16/2018

Chemi-net: a graph convolutional network for accurate drug property prediction

Absorption, distribution, metabolism, and excretion (ADME) studies are c...
03/30/2020

Autonomous discovery in the chemical sciences part I: Progress

This two-part review examines how automation has contributed to differen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge of the chemical structure of compounds is central in drug discovery because this structure determines the properties of the compound. It is for example used for drug candidate selection. Because billions of euros of research and development investment are needed to successfully bring a new drug to the market, any tool that improves the drug candidate selection process would have a significant pharmaceutical impact.

Although chemical structures, which are the familiar graph drawings of molecules, do lose some information about the electronic structure of a molecule (which is actually responsible for its chemical properties), they are powerful and effective abstractions. To query such structures or apply machine learning, we need to start from a well-structured data set encoding the graph representation of the chemical structure. This encoding step, which is usually less flexible than an arbitrary drawing, might lose also some information about the chemical structure, but it will provide a solid starting point for further automated processing. Interesting formats for representing chemical compounds are for example SMILES [27] or MOLfile [6], which contains all necessary information to build the complete molecule graph structure. Using these formats, it would for example be possible to query documents for specific patterns in chemical compounds. However, such encodings still remains somewhat cumbersome and are not yet systematically available, in particular for unstructured legacy data.

Thousands of scientific publications describe new chemical compounds and investigate their properties. However, the structure of these chemical compounds are usually described in the publication only as an image. This means that today a rich source of data, which would be extremely valuable to develop novel machine learning approaches or simply query documents more accurately, is largely under-exploited. It is therefore important to convert images of chemical structures into these formats. A few tools for recognizing graph structures from chemical compound images are available, such as OSRA [8] and others like ChemReader [18], Kekule [17], CLiDE Pro [25], and the work of M. Sadawi et al.. However, we observe that, using these tools, bond multiplicity and stereochemical information are sometimes lost. Those tools are mainly expert systems using different techniques, such as image processing, optical character recognition, hand-coded rules, or sophisticated algorithms. Modifying or further improving these tools requires a lot of effort. A tool based on machine learning, which learns directly from training data, would be most valuable. Such a tool could potentially become more accurate than existing methods and its performance could be improved by increasing the size and the diversity of the data sets, instead of having to modify its code.

Therefore, we propose a new data-driven machine learning based tool that can learn from only image data to recognize the chemical structure graph given an image of a chemical structure. The core of the tool is a deep learning model. In the work of Staker et al., another deep learning model was also proposed. However, there the output is only a text-sequence representing the graph. In our approach, we focus on directly predicting the graph structure (i.e., identifying all the nodes and the edges and their labels). The positions of these nodes and edges in the resulting graph would correspond with the positions in the original image of the chemical structure. The resulting graph can be later translated to any format (e.g., SMILES).

In the next sections, we will describe the method and the neural networks used, and also how the different networks interact. Next, we describe the data sets used for training. Then we focus on the performance and scalability of our method, and conclude with possible future work.

2 Related work and Background

A deep convolutional neural network

[15] is the type of network most often used for image recognition. These convolutional neural networks can be split into two main types: (1) image segmentation networks and (2) classification networks. We combine both approaches in our graph recognition tool.

2.1 Image segmentation

Our work builds upon the recent developments in image segmentation. Different machine learning approaches can be used for the segmentation of images. One well-established approach is U-Net [20]. This approach uses a network that combines a contracting path and an expanding path. Several other works were based on the U-net approach, such as Jansson et al., where a U-net is used to extract the vocal component from music. Other works expanded this U-net approach, such Çiçek et al., which generalizes the U-net approach to 3D images. Another approach is to make use of dilated convolutions [30] all stacked without loss of resolution. An advantage of dilated convolutions is that the receptive field can grow exponentially by increasing the dilation in the dilated convolutional operator. This would be computationally more interesting than using multiple convolutions or larger kernels.

2.2 Image classification

There has been a trend to create deeper and deeper neural networks to improve performance for the classification of images. However, deeper networks are more difficult to train. To improve the training capabilities of such deep networks, methods such as residual neural networks (Resnet) [11] were developed. It has also been shown that these residual neural network are comparable in performance and behavior with an ensemble of more shallow networks [26]. It is also worth mentioning the work of Zhang et al. where the concept of Resnet is combined with the concept of U-net for the localization of roads on aerial images.

2.3 Drug discovery and machine learning

There are several stages in the process of drug discovery. The stages go from basic research and drug candidate selection to the development phase, clinical trials, and finally production. As development progresses further and sunken costs increase, the cost of failure of a project thus increases. "Fail early" is thus important to contain the costs of drug discovery. Predicting risks of failure later in the discovery process (for example, by predicting toxicity for a compound) without draining the pipeline (enough candidate compounds need to remain available) is essential. Machine learning techniques can be used in all stages of drug discovery. Chen et al. gives a good overview of the recent use of deep learning in drug discovery. We would like to highlight some of these recent applications, which we find interesting in the context of our graph recognition tool.

In the first place, there is the work from Xu et al. and Gómez-Bombarelli et al., where an unsupervised method is used to extract features from SMILES input data. SMILES(Simplified Molecular Identification and Line Entry System) [27]

is a text representation of a chemical compound following specific syntactic rules. The unsupervised learning methods in both works are based on the auto-encoder principle. This feature vector can then be used as input to a supervised method to learn to predict molecular properties.

Another interesting method to predict molecular properties of a chemical compound is to use the neural graph fingerprint presented in Duvenaud et al.. The neural graph fingerprint is a way to represent and encode a chemical compound. Here, a convolutional neural network takes these graphs as input and is trained to predict molecular properties. Similarly, in Kearnes et al., Coley et al., Simm et al. and Pires et al., a machine learning model takes a molecular graph as input.

Large amount of data is needed to use or train the models mentioned above. It is not always easy to find this data. This is where our tool is useful, by extracting graph representations of chemical compounds directly from images. It is also worth mentioning the work presented in B. Goh et al., where no representation of the chemical compound is needed. In this work, a machine learning model is trained to predict molecular properties directly from images from chemical structures.

2.4 Stereoisomerism

Stereochemical information can also be encoded in a 2D representation of a molecule. This stereochemical information is important to differentiate molecules with the same molecular formula but with a different spatial orientation. To encode this central chirality, different type of lines are used to represent bonds in the 2D representation of a molecule: solid lines, wedge-shaped lines or dashed lines [23]. It is important that this information is captured by our graph recognition tool.

3 Problem Statement

In this section, we formulate our learning task. The goal of the proposed method is to learn a function that maps an image to its graph representation .

Definition 1

represents a single-channel 2D image with dimensions .

Generalizations to multiple channels is straightforward, if colored inputs are available.

Definition 2

represents a graph with vertices and edges .

For our graph recognition tool to work, we need to learn the following function:

(1)

This function will map an 2D input image of a chemical structure to the graph representation of the molecule. To learn this function, we make the following assumption for the training data set:

Assumption 1

We assume we have the knowledge about the location of every node in the graph represented by , with number of nodes in the graph for every image in our labeled training data set. We also assume the knowledge of all inter-node connections in the graph (edges) for all labeled images represented by .

Definition 3

The training data set is then defined as .

4 Model

To solve the problem statement defined in the previous section, we build a machine learning model. The model is split up in different learning tasks, which will be defined here.

4.1 First task: segment type segmentation

The first learning task is to learn to segment a 2D image of a chemical structure in different segments, where each segment can represent the location of a specific atom, charge or bond type in the image. The image was already defined in previous section. However, the segmentation of this image will be defined here.

Definition 4

, , represent the atom type, bond type and charge segmentation of an image. and are the same as in the input image while , and respectively are the number of atom types, bond types and charges (including the empty atom, charge and bond type) present in the compound.

To perform image segmentation, we need to learn the following function

(2)

To learn this function, we need to label the training elements.

Definition 5

Let then we define , , which represent the pixelwise true labels. and are the same as the input image. , and are respectively the number of atom types, bond types and charge (including the empty atom, charge and bond type). The value of every element represent the atom, charge and bond type to which the corresponding pixel belongs.

Once the true labels for the training data have been defined, we can define the loss function for training. Here, we will use the Cross Entropy Loss, which is defined as

(3)

where

is the true probability distribution of the true labels,

is the estimated probability distribution of the labels, and

is the number of different classes. In the case of atom type segmentation , the cross entropy loss is calculated and summed for every pixel prediction (so fixing and ) in the following way, taking into account that is not a probability distribution:

(4)

The loss (, ) in the case of bond type segmentation and charge segmentation is calculated similarly. The total loss is the sum of all partial losses:

(5)

4.2 Second task: segment type classification

A second learning task is necessary to build a final graph. This learning task classifies parts of the segmented image into the different possible atom, bond and charge types. One part of the input used in this learning task is defined in the following way:

Definition 6

, , where 2 times the regular bond length in a 2D image of a chemical structure and , and

are respectively the number of different atom types, bond types or charges (including the empty types). The tensors

, and represent a cut-out of the tensors , or .

Another part of the input used in this learning task is defined as:

Definition 7

, , represent the cut-outs of the original 2D image where 2 times the regular bond length in a 2D image of a chemical structure.

Next, we define the output used in this learning task.

Definition 8

, , are vectors of dimension , and where , and are respectively the number of different atom types, bond types or charges (including the empty types). The vectors , and represent respectively the atom type, bond type and charge predictions.

With these definitions, we can now also define the functions to be learned in this task:

(6)
(7)
(8)

To learn these functions, we need the labels of the training data.

Definition 9

Let then we define , and respectively represent the label of atom type, bond type and charge of each training element. , and are respectively the number of atom types, bond types and charge (including the empty atom, charge and bond type).

Finally, we also define the loss function used in the training phase for learning funtion .

(9)

The loss functions (, ) for the functions and can be defined similarly.

5 Graph Building Algorithm

Once we have learned the functions described in the previous section we need an algorithm to combine the outputs of these functions and build up a final graph structure. We propose the algorithm defined in Algorithm 1:

Data: Image tensor
Result: Graph
,, = [] for  in  do
       ,, = ,, = , , if  then
            
       end if
      
end for
[] for  in  do
       ,) , if  then
            
       end if
      
end for
Algorithm 1 Graph building algorithm

The proposed algorithm 1 will first apply the segmentation function to the input image. Next, given the segmentation , candidate locations will be generated by . Given these candidate locations, the nodes of the graph can be build in an iterative way. For this purpose, the segmentations and can be cut () into smaller segments and thanks to every candidate location . At the same time the original image is also cut () into smaller parts and . Extra highlights are also created and which highlight the candidate location to be classified. Then, the classification functions and are applied to determine what kind of atom and charge is located at the candidate location. If the candidate location is not empty (), the location, type and charge will be added to the list of nodes . Later, the algorithm will use these nodes to build the edges of the graph . For this, it first will need to generate () the candidate bond locations (). Similarly, as for the nodes, the bonds of the graph can be built in an iterative way. For this purpose, the segmentation can be cut () into a smaller segment thanks to every candidate bond location . At the same time the original image is also cut () into a smaller part . Extra highlights are also created which highlight the bond location to be classified. Finally, the classification function is applied to determine the type of bond located at the candidate bond location. If the candidate bond location is not empty (), the location and type will be added to the list of bonds .

6 Deep learning implementation

The method we use in the graph recognition tool is a combination of different convolutional neural networks [15]. First, we have a semantic segmentation network using the Dense Prediction Convolutional Network [21; 30] followed by three classification networks. The output of the segmentation network is part of the input of the other classification networks.

6.1 Semantic segmentation network

Before feeding the image to the segmentation network , the image is preprocessed to a binary black and white image. The output of the segmentation network are different channels predicting for every pixels in the image the class the pixels belong to. The possible classes represent the different atom types, bond types and charges. For the implementation of this network, we build on the concept of dilated convolution described in Yu and Koltun.

6.1.1 Network architecture

The network has 8 3x3 convolutional layers from which 6 layers make use of dilation. All convolutional layers are followed by a Rectified Linear Unit (ReLU). The last layer is a linear layer. Padding is used so that the resolution of the channels does not change. The padding and dilation for the different convolutional layers are summarized in Table 

1.

Layer Kernel Nonlinearity Padding Dilation
conv1 3x3 ReLU 1 no dilation
conv2 3x3 ReLU 2 2
conv3 3x3 ReLU 4 4
conv4 3x3 ReLU 8 8
conv5 3x3 ReLU 8 8
conv6 3x3 ReLU 4 4
conv7 3x3 ReLU 2 2
conv8 3x3 ReLU 1 no dilation
last 1x1 none no padding no dilation
Table 1: Summary of the layers of the segmentation network

6.2 Classification networks

For the atom location, the bond prediction and the charge prediction we use three separate classification networks. All three networks use part of the output of the segmentation networks in their input.

6.2.1 Atom location prediction

For the atom location prediction, part of the output the segmentation network () is used. This output contains the segmentation of only the atoms in the image. Next, the original binary image () is also used for the input. Finally, candidate locations are created () to spot the part of the image we want to classify. This is also formatted as input and fed into the network. Depending on the candidate location, the inputs can be reduced () so that only the immediate region of the candidate location is included (). This will reduce the computational cost and speed up the learning of the network. This process is illustrated in Figure 1.

to
Figure 1: To build the input for the atom classification network (), the output of the segmentation network is cut () to feed it in to the atom classification network. This cut-out is shown in the middle (). To this, we also add part of the original image () together with highlighting the candidate location () of the atom type to be classified. The complete input for is shown on the right.

The output of this network is a vector () with every element representing the prediction of the network. The size of this vector is the number of different atom classes, plus one to represent the empty class (no atom). For every image segmentation, this network has to run several times to classify all candidate locations to get all atom predictions in the original image.

6.2.2 Bond prediction

For the bond prediction network, we apply a similar strategy as illustrated in Figure 2. This time another part of the output the segmentation network is used (). This output contains the segmentation of only the bonds in the image. Every type of bond is represented in the segmentation as a rectangle. For stereo bonds, we use two rectangles to encode the direction of the bond. Next, as in the atom prediction network, the original binary image () is used also for the input of the bond prediction network. Finally, for the bond prediction, we also need to encode candidate locations for the bond predictions. This time, as opposed to the atom prediction network, we will use two parts. One rectangle represents the first part of the bond connected with the first atom and another rectangle represents the second part connected with the second atom. The rectangles meet in the middle. By using two rectangles we can encode the direction of the bond which is necessary to predict the stereoisomeric bond direction. These candidate locations are generated () from the predictions from the atom location network. Moreover, depending on these locations, we can cut out () the inputs again so that only the immediate region of these candidate pairs is fed into the network.

Like with the atom prediction network, the output is again a vector () with every element representing the prediction of the network. This time the vector size is the number of different bond classes, plus one to represent the empty class (no bond).

to
Figure 2: To build the input for the bond classification network (), part of the segmented image () is cut out (). This cut-out () is shown in the middle of the figure. To this, we also add part of the original binary input image () and the candidate bond location () encoded in two parts. The complete input for is shown on the right.

6.2.3 Charge prediction

As with the bond prediction and the atom prediction network a similar strategy is used illustrated in Figure 3. Again a part of the output the segmentation network is used (). This output contains the segmentation of only the charges in the image. Every charge is represented by a rectangle located on the location of the atom it applies to. Depending on the candidate location generated by , the inputs can be again reduced () so that only the immediate region of the candidate location is included (). Again the original image () is fed together with the candidate location as input to the charge prediction network.

Like with the atom and bond prediction network, the output is again a vector () with every element representing the prediction of the network. This time the vector size is the number of different charge classes, plus one to represent the empty class (no charge).

Once we have the bond predictions together with the atom and charge predictions, we can build the graph structure of the complete molecule.

to
Figure 3: To build the input for the charge classification network (), the output of the segmentation network is cut () to feed it in to the charge classification network. This cut-out is shown in the middle (). To this, we also add part of the original image () together with highlighting the candidate location () of the charge type to be classified. The complete input for is shown on the right.

6.2.4 Network architecture

The three classification networks have similar layer structures. There are 5 convolutional layers where 3 of them are dilated and 1 of them (the first one) is actually a depthwise separable convolution [3]

. After the convolutional layers, there is always a ReLU layer. The last layer is a linear layer and the layer before that is a max pool layer. All layers are summarized in Table 

2.

Layer Kernel Nonlinearity Padding Dilation
depthconv1 3x3 ReLU 1 no dilation
conv2 3x3 ReLU 2 2
conv3 3x3 ReLU 4 4
conv4 3x3 ReLU 8 8
conv5 3x3 ReLU 1 no dilation
maxpool 124x124 None no padding no dilation
last 1x1 None no padding no dilation
Table 2: Different layers in the classification network

7 Data Sets

To build our data sets for the segmentation network and the classification networks we download and split different chemical structures in SMILES format from the ChEMBL [9] database. The around 1.9 million chemical structures are splitted in 4 parts:

  • a training pool for the segmentation network of 1.5 million chemical structures,

  • a pool of 300K chemical structures used for the validation of the segmentation network and training of classification networks,

  • a pool of around 35K chemical structures for the validation of the classification networks and

  • another pool of 35K chemical structures for testing the overall performance.

From these pools we can sample the actual data sets for our different networks. By sampling we can control the relative frequency of different atom types and bond types in the actual data sets. This is important for the performance of our networks, due to data imbalance, on the different atom types and bond types as we will see in the next section.

7.1 Segmentation Data Set

For the training of the segmentation network (), we need 2D images of chemical structures together with pixelwise labeled target values (, and ). This type of data set is not available as far as we know, so we need to construct this data set ourselves. Labeling thousands of 2D images of chemical structures pixelwise by hand is moreover not feasible. We thus construct an automatic procedure to generate this data set. For the training data set we sample around 114K chemical compounds in SMILES format from the ChEMBL training pool in a way that every atom type is present in at least 1000 chemical compounds. Using RDkit [14] in Python, we create the images starting from the SMILES. Furthermore, to create the labelling, we make some modifications in the code of RDkit at the drawing time of the image, so that it additionally produces the necessary labeling information needed to create our data set. For the validation dataset the same procedure can be used.

7.2 Atom prediction data set

Once the segmentation network is trained and validated we can sample from the ChEMBL data a new data set to feed into the segmentation network. The output of these runs are saved to create the input data set for the next classification networks. As already explained in the previous section, the atom prediction network () additionally expects as input the candidate locations to classify. For the training and validation data sets of the classification network , we generate candidate locations based on the true atom location values, but also add locations where no atom is located for the prediction of the empty class. For these locations, we take the middle point of every bond in the data set. As we know that no atom is located in the middle of a bond, these locations can be used for the empty values in the data sets.

7.3 Bond prediction data set

For the bond prediction network (), we apply a similar technique. In addition to the inputs from previous segmentation network, the bond prediction network expects the candidate bond locations. For the training and validation data of , we generate these candidate locations by going over all possible combinations of pairs of atoms in a molecule within the range of less than two times the bond length. Sometimes there is a bond between a generated pair of atoms and then the data set item is labeled with the type of bond. If there is no bond between a pair of atoms, the item is labeled as empty.

7.4 Charge prediction data set

For the charge classification network the same data sets as the atom prediction data sets can be used except for the labels. Instead of the atom types the labels would now be the charge (including empty charge) of the atom candidate.

8 Experiments and results

For validation we sample new data sets from the validation ChEMBL pools for the different networks. For the segmentation network we sample around 12K chemical structures. For the validation of the classification networks less chemical structures are needed, so we only sample around 450 chemical structures. Starting from these 450 chemical structures we generate atom candidates and bond candidates. This results in a data set of around 27K of atom candidates for atom type and charge classification networks and a data set of around 55K of bond candidate locations for the bond classification network. With these validation data sets we measure the performance on the different networks.

8.1 Performance of segmentation network

For the segmentation network () we measure the F1 score [28]

for all the pixel predictions for the different atom, bond and charge types. The F1 score takes into account both precision and recall equally. If we compare the F1 score with the frequency of the different atom, bond and charge types in the training data set we clearly see a correlation. The results are summarized in Figure

4 .

(a) Atom prediction performance ( and )
(b) Bond prediction performance ( and )
(c) Charge prediction performance ( and )
Figure 4: F1 score for segmentation and classification networks. There is clearly a correlation between the performance of the networks on the different prediction types and the frequency of the specific type in the training data set. The classification networks perform significantly better than the segmentation networks.

8.2 Performance of classification networks

For the classification networks we again use the F1 score to measure the performance for the atom, bond and charge type classifications. Again we see a correlation between the F1 score and the frequency of the different types in the training data set. We can also empirically see that the F1 score for the classification networks is significantly higher than for the segmentation networks. So the classification networks can do a good job even when the segmentation is not perfect. The performance of these classification networks have to be very good as for every graph prediction tens of bond and atom classifications have to made and this would otherwise degrade the overall accuracy rapidly. The results are summarized in Figure 4 .

8.3 Overall graph accuracy

Now that we know the performance of the different parts, we can combine those building blocks and measure the overall accuracy of the resulting graph predictions. As already mentioned in a previous section, the segmentation network and classification networks should be used as presented in algorithm 1 in order to build the resulting graph. Images in 3 different styles are generated and for every style we generate 2 sets where 1 set only has images without stereo chemical information encoded in the compounds while the other set has images where all compounds have stereo chemical information encoded. This results in 6 sets of each 1000 images to measure the performance on our tool ChemGrapher. If we count at least one mistake in the resulting graph we count the graph prediction as incorrect. The same sets we also use to measure the performance on OSRA to compare. The results are summarized in Figure 5 . On all sets we observe a a higher accuracy on our tool ChemGrapher compared to OSRA.

(a) Error rate for style 1
(b) Error rate for style 2
(c) Error rate for style 3
Figure 5: The graph accuracy of our tool compared with OSRA measured on images generated in different styles. For each style 2 experiments are performed: once with images without stereo-chemical information and once with images with stereo chemical information. On the left we observe for each style the results on the error rate and on the right we observe for each style an example image in that specific style. For all styles we measure a lower error rate for our tool ChemGrapher compared to OSRA.

8.4 Case Study: Performance on Journal Article Images

The idea of ChemGrapher is to use it on images in journal articles therefore we would also like to know how well this tool performs on such images. As this kind of data set is not available we decided to build one manually. So we cut out images from journal articles about chemical compounds, preprocess them to the correct input format and feed them to our tool. We evaluate the resulting graph manually on correctness to measure the accuracy. If we count at least one mistake in the resulting graph we categorize the prediction as incorrect. The same procedure was executed for OSRA for comparison. The results of this experiment are summarized in Figure 6. Thus, out of total of 61 images we tried on ChemGrapher, 46 were correctly predicted while OSRA predicted 42 images correctly. However, we can also observe that ChemGrapher clearly has better performance on images of compounds with only carbon atoms compared to OSRA. For these compounds typically no letters appear in the image. Another observation we make is that ChemGrapher still has some issues related when thick lines are used to depict the bonds. We set this as a target for our future work.

Figure 6: Error rate of our tool ChemGrapher on test set of journal article images compared with OSRA. From the errors we learn there is still room for improvement in future work.

9 Future Work

To train the segmentation network, we need a pixelwise labeled data set. However, this kind of data set is not always available. We thus created this data set with RDkit. However the consequence is that the format of the input image is somewhat biased. We have seen in the case study that ChemGrapher performs reasonably although not equally well on real images. To handle other kind of image formats, it might be difficult to find a pixelwise labeled data set to retrain our networks. Therefore, future work could focus on building a method that can learn from data that is not labeled pixelwise. The data would only offer a way to verify if the resulting graph is correct or not. We could consider this as an instant of weakly supervised learning.

10 Conclusion

We presented a method to recognize the graph structure of molecules from 2D images of chemical structures using deep learning. This method learns a model directly from data. We have seen that careful data preparation is crucial. Care should be taken to have a balanced data set for the different classes of atoms and bonds. However, even with an imperfectly balanced data set, our deep learning methods give very good results. One thing that is important for our method to work is the fact that the classification networks need to have an almost perfect accuracy. While the segmentation network can tolerate some errors, for the classification networks every drop in accuracy can have dramatic results on the overall accuracy. The performance is also clearly better than the well known tool OSRA [8] and also provides us more detailed information about the resulting graph. For our deep learning method to learn accurately, we also had to implement an automatic procedure to pixelwise label 2D images of chemical structures. We described how we modified the code of RDkit for this purpose. In fact, this pixelwise labelling of images for the segmentation is actually key to linking the atoms and bonds in the resulting graph back to the source image. This makes this deep learning method non-black box or interpretable. In the context of drug discovery, such tools are important. In fact, in general we see that machine learning is gaining importance in this area and that it contributes to improving the quality of the drug discovery process.

Acknowledgments

MO, AA, YM and JS are funded by (1) Research Council KU Leuven: C14/18/092 SymBioSys3; CELSA-HIDUCTION, (2) Innovative Medicines Initiative: MELLODY, (3) Flemish Government (ELIXIR Belgium, IWT: PhD grants, FWO 06260) and (4) Impulsfonds AI: VR 2019 2203 DOC.0318/1QUATER Kenniscentrum Data en Maatschappij. Computational resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation - Flanders (FWO) and the Flemish Government – department EWI. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

References