Automatically Extracting Action Graphs from Materials Science Synthesis Procedures

Computational synthesis planning approaches have achieved recent success in organic chemistry, where tabulated synthesis procedures are readily available for supervised learning. The syntheses of inorganic materials, however, exist primarily as natural language narratives contained within scientific journal articles. This synthesis information must first be extracted from the text in order to enable analogous synthesis planning methods for inorganic materials. In this work, we present a system for automatically extracting structured representations of synthesis procedures from the texts of materials science journal articles that describe explicit, experimental syntheses of inorganic compounds. We define the structured representation as a set of linked events made up of extracted scientific entities and evaluate two unsupervised approaches for extracting these structures on expert-annotated articles: a strong heuristic baseline and a generative model of procedural text. We also evaluate a variety of supervised models for extracting scientific entities. Our results provide insight into the nature of the data and directions for further work in this exciting new area of research.


page 1

page 2

page 3

page 4


The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures

Materials science literature contains millions of materials synthesis pr...

Annotating and Extracting Synthesis Process of All-Solid-State Batteries from Scientific Literature

The synthesis process is essential for achieving computational experimen...

ULSA: Unified Language of Synthesis Actions for Representation of Synthesis Protocols

Applying AI power to predict syntheses of novel materials requires high-...

Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks

Leveraging new data sources is a key step in accelerating the pace of ma...

An automated domain-independent text reading, interpreting and extracting approach for reviewing the scientific literature

It is presented here a machine learning-based (ML) natural language proc...

1 Introduction

The targeted design and discovery of novel materials remain a key challenge across multiple subfields of chemistry and materials science Curtarolo et al. (2013); Jansen (2015); Sumpter et al. (2015); Gómez-Bombarelli et al. (2016)

. Accurate, machine-learned predictions of relations between inorganic materials structures and properties have proliferated in tandem with the vast growth of data computed through first-principles methods

Meredig et al. (2014); Jain et al. (2013); Kirklin et al. (2015), but the progress in predicting and understanding inorganic materials synthesis is stagnant by comparison, due to the high cost of producing and tabulating new syntheses.

The syntheses of inorganic materials are available almost exclusively as unformatted natural language text contained within journal articles, and this domain-specific text is often non-trivial to parse Kim et al. (2017a). A broadly-applicable technique for extracting structured representations of inorganic synthesis routes is thus a critical step towards realizing a framework which links synthesis parameters to the properties and structures of produced materials.

In this work, we present a system for automatically extracting structured representations of inorganic synthesis routes. We define these representations of synthesis text based on those used by Kiddon et al. Kiddon et al. (2015). These structures, termed action graphs

, consist of a set of nodes connected by edges. Nodes represent operations in the synthesis and the arguments associated with each operation. Edges represent the association of an argument with an operation, or denote an argument as having originated from a given operation. Given synthesis procedure text, we first extract individual events in the synthesis using a neural network entity tagger and a set of dependency parse-based heuristics. We then induce edges to compute the sequence of synthesis events. To accomplish this, we employ a simple yet strong baseline method of linking arguments of an event to the previous event, and a modification of the unsupervised generative model for procedural text proposed by Kiddon et al.

Kiddon et al. (2015).

Our results indicate that the baseline model which resolves every argument as having arisen in the previous operation out-performs the generative model to induce edges between events in all our evaluation settings. The strong performance of the baseline model hints at the strongly sequential nature of inorganic materials synthesis procedures. Our evaluation also suggests that the bottleneck in extracting high quality action graphs is the event extraction step; our current approach is able to extract about 56% of the participants of an event (i.e., 56% of the nodes in the graphs in our test set).

In the following section, we describe related work. This is followed by a description of the action graph extraction task and the graphs themselves. We follow this with a description of our current extraction pipeline. Finally, we present our results and conclusions.

2 Related work

Materials Science and Chemistry: The rise of comprehensive materials property and reaction databases has accelerated the development of chemical synthesis planning through the use of first-principles and machine-learning computational techniques Kim et al. (2015); Hachmann et al. (2011); Jain et al. (2013); Kirklin et al. (2015); Lawson et al. (2014); Raccuglia et al. (2016). Using chemical reaction records retrieved from a database, Grzybowski et al. showed that it is possible to construct a “universal” graph of chemistry such that molecules and reactions correspond to vertices and edges, respectively Grzybowski et al. (2009): Synthesis action graphs are computed by resolving unitary chemical reactions into action vertices with input and output molecules denoted by directed graph edges. Traversals on this universal chemical reaction graph allow for the optimization of pathways (e.g., for minimizing economic cost) Grzybowski et al. (2009), but methods for predicting novel, highly-structured synthesis pathways have remained elusive until very recently. The work by Coley et al. and Segler et al. investigates two complementary problems by learning on historical chemical reaction databases Coley et al. (2017); Segler et al. (2017). Using a neural-network-driven candidate ranking approach, Coley et al. produce a model for predicting organic reaction outcomes Coley et al. (2017). Conversely, Segler et al. approach the opposite problem, using Monte Carlo tree search to predict a synthesis pathway for a given output molecule. Impressively, the results attained by Segler et al. are shown to perform at a level comparable to human-driven organic molecule synthesis planning Segler et al. (2017).

While significant strides have been made in computational synthesis planning via the prediction of synthetic action graphs in organic chemistry, these approaches have relied significantly on datasets comprised of historical chemical reactions. Despite efforts to standardize the reporting of chemical and materials science data

Murray-Rust and Rzepa (1999), inorganic

materials synthesis routes continue to reside in non-standard form. Several past studies have applied a variety of text extraction techniques to materials science literature using regular expressions, lexicon matches, or human-driven data labelling to extract synthesis information

Hawizy et al. (2011); Jones et al. (2014); Young et al. (2017); Raccuglia et al. (2016); Ghadbeigi et al. (2015), but such approaches are inherently difficult to scale across sub-domains of materials science since new rules and lexicons must be created by domain experts for different types of materials or synthesis techniques. Accordingly, Kim et al. have previously developed automatic methods for extracting aspects of inorganic synthesis routes from natural language text to step in the direction of building comprehensive databases of the syntheses of inorganic materials Kim et al. (2017a, b).

Natural Language Processing (NLP): Extracting action graphs relates closely to the problem of extracting event chains in the NLP literature. Formalized by Schank and Abelson Schank and Abelson (1977), this line of work defines domain-specific (typically news wire) structured representations of a prototypical sequence of events. Work by Chambers and Jurafsky Chambers and Jurafsky (2009) extracts chain-of-events structures from newswire text using co-occurrence counts of verb-argument pairs and clustering verbs and arguments using mutual information-based similarity metrics. Many extensions to this line of work have been proposed Balasubramanian et al. (2013); Pichotta and Mooney (2014); Cheung et al. (2013). These approaches are typically either trained with supervised data, or make many domain-specific assumptions which do not carry over to materials science syntheses. Instead, our work is based on that of Kiddon et al. Kiddon et al. (2015), who introduce the notion of action graphs as formalizations of event chains for procedural text (cooking recipes), proposing a generative model to extract these structures. This work has also recently been used to extract action graphs from instructional videos and transcripts (Huang et al., 2017).

3 Task definition

(a) Example of typical synthesis procedure text.

(b) Possible partial action graph for text.
Figure 3: Example of a synthesis procedure and the shortened action graph for the synthesis procedure, adapted from Dong et al. Dong et al. (2009). The nodes of the graphs are the operations and arguments and the edges represent association between event arguments or reference across events. Ellipsis/missing arguments are dealt by adding “implicit argument” nodes. Nodes in gray are lack reference edges and represent “raw” nodes.

We aim to extract structured representations of synthesis procedure text, as reported in journal articles, describing inorganic (e.g., hydrothermal, sol-gel, solid state) syntheses of materials. An example synthesis procedure Dong et al. (2009), depicted in Figure 3, may be viewed as a set of events. Each event consists of an operation and a set of arguments: the arguments may be entities such as the conditions for the operation, apparatuses used, and materials involved. The structure we extract from the text is an action graph where the nodes are the operations and their sets of arguments. The edges within an event represent association of an operation with its arguments, and edges between events represent the “flow” of arguments as having originated from certain operations.

Extraction of action graph structures from synthesis text presents a two-fold challenge. First, given sentences from scientific articles, it is difficult to extract the correct set of events and arguments. This is compounded by the fact that most sentences also tend to describe multiple events. Indeed, our data indicate that each sentence contains on average two operations. Second, resolving references of arguments between events is non-trivial. As an example, it is necessary to correctly determine the origin of an intermediate material, referred to as a “black solid,” when constructing the action graph for the synthesis procedure shown in Figure 3. In the present setting, the resolution of references requires some domain knowledge since arguments change physically and chemically between events; in this case, the “black solid” turned into the “black slurry”. Another compounding factor is the presence of ellipses or missing arguments; for example, “sealed” and “maintained” both lack an explicitly mentioned argument, although it is clear from the context that the argument is an “autoclave.”

3.1 Action Graph Formalism

We define action graphs by modifying the definitions in Kiddon et al. Kiddon et al. (2015). The set of events in the synthesis procedure are represented by . The event set consists of events , . Each event is a tuple of the form , an operation and a set of typed arguments . Every typed argument is a tuple, . Here represents a semantic type of the argument and and is the set of string spans from the text which are instances of the arguments. The string spans and the operations therefore represent the nodes of the graph. There are two types of edges, “association” edges within events and “reference” edges between events (labeled in Figure 3). Edges of association are represented by the 4-tuple where identifies the operation and identifies the string span. Similarly, references edges are made up of a 5-tuple . Ellipsis are handled by introduction of “implicit argument” nodes. Nodes for raw-materials or explicit apparatus (shown in gray in Figure 3) lack reference edges.

4 Extraction pipeline

Synthesis text extraction: Our pipeline begins by extracting paragraph demarcated raw text from PDF documents, using a method described in more detail by Kim et al. Kim et al. (2017a, b). The plain text articles are extracted using the WatrWorks111WatrWorks, Text Extraction and Annotation Suite:

text extraction tool. From this text, we extract the set of paragraphs which are likely to contain synthesis procedures. This is accomplished using a logistic regression classifier trained on word-embeddings and a set of manually designed features

Kim et al. (2017a). The word embeddings222Embeddings: were obtained by training the Word2Vec algorithm on a corpus of materials science research papers Kim et al. (2017a). Multiple paragraphs in a given paper labeled by the classifier as having a synthesis procedure are treated as part of the same synthesis procedure. Sentence and token boundaries are determined with ChemDataExtractor Swain and Cole (2016), which is specifically tailored to processing chemistry-related text.

Entity extraction: Given this synthesis text, our next task is to extract entity mentions from the text. We use the term “entity mentions” to denote spans of text that will participate in the experiment, such as black slurry or stirred

. We cast this as a supervised task akin to the classic NLP problem of Named Entity Recognition or Entity Extraction, generating training data by manually annotating a small set of papers for this purpose (see Section

5.1.1 for details).

We experiment with the following probabilistic models for entity extraction. Let be a sentence of input text and be per-token output tags. Let be the number of possible labels for each . We predict the most likely , given a conditional model .

We experiment with two factorizations of . First:


where tags are conditionally independent given some features for

. These features could be a binary vector representing each token’s membership in e.g. a lexicon, or they could be a dense vector encoded by a deep neural network which takes distributed representations of words as input. In Section

5.1.3 we present experiments on the latter, where the deep neural network is either a bidirectional LSTM (Lample et al., 2016), or a dilated CNN (Strubell et al., 2017).

We also consider a linear-chain CRF model that couples all of together, enforcing constrants between different labels during prediction:


where is a local factor, is a pairwise factor that scores adjacent tags, and is the partition function (Lafferty et al., 2001). Prediction in this model is performed using the Viterbi algorithm. In Section 5.1.3 we experiment with models where

is encoded by a bidirectional long short-term memory (LSTM), a pre-trained word embedding, a binary vector constructed from hand-engineered features based on linguistic analysis such as syntactic parses of the sentence and part-of-speech tags, and combinations thereof.

Event extraction: Once we have extracted entities, we must combine them into events. Since sentences from the synthesis text often describe multiple events we break the sentence into separate event phrases by applying heuristic rules on the dependency parse of the sentence. We obtain dependency parses from the Stanford CoreNLP dependency parser (Manning et al., 2014). Given the parse, the most important of these heuristics breaks every phrase, whose head token links to the sentence root with a conj relation. All other tokens are associated with the root and constitute the main phrase. Each split phrase and the main phrase is considered to be an event. In the case of no conj relation to the root, the whole sentence is considered a single event. We apply these heuristics only to sentences with multiple operations. On identifying sentence segments representative of events all the tokens tagged as operation, raw-material, intermediate or apparatus are extracted and treated as nodes of the particular event. Figure 6 denotes the illustration of an example. Implicit argument nodes are added to events lacking an argument of the apparatus or intermediate type.

(a) Example result of running Stanford CoreNLP dependency parser.
(b) Tokens grouped by relations to the root token
Figure 6: An example illustrating use of dependency graphs to split a sentence consisting of multiple events into a set of separate phrases for each event. Phrases with the conj relation to the root word are broken off as potentially separate events.

Edge induction: Following the extraction of individual events, we then induce the reference edges from argument nodes of intermediate materials to operations. These edges denote the operation that gave rise to a given intermediate material. In inducing these edges we attempted two approaches. The first of these approaches assumes that every intermediate material is derived from the previous operation. This approach forms our baseline (Sequential model) and as we elaborate in Section 5.2.3 turns out to be a very strong baseline. For example, in Figure 3, the sequential model would link “black solid” to “stirring” and link “black slurry” to “appeared”. The baseline therefore assumes a strong sequential structure for the events and assumes text order to be the correct order in resolving entity origins. The other approach we tried an unsupervised generative model proposed by Kiddon et al. Kiddon et al. (2015). This model, trained with a hard-EM algorithm explicitly models the structure of procedural text and attempts to connect arguments to one of previous event operations. However, as we present in Section 5.2.3 the generative model did not perform as well on out dataset, we think, mainly due to the strongly sequential nature of the data itself. Next, we describe our approach to evaluating the action graphs and the data that we use in analysis of our results.

5 Experimental Results

We present experimental results on two components of our action graph extraction pipeline: entity extraction, in which we identify mentions of scientific entities in the text, and action graph extraction, the final output of our pipeline.

5.1 Entity Extraction

We evaluate six models for entity extraction: Three baseline linear-chain CRF models using different sets of linguistically motivated features, three models which perform logistic regression with token representations encoded by different neural network architectures, and a combined structured neural network model which takes the token representations encoded by a neural network as the logits for a linear-chain CRF (Eqn.


Specifically, the CRF-ling model is a CRF where the feature representation for each token is a binary feature vector encoding various linguistic features of the token and lexicon membership. Linguistic features include: part-of-speech tag, lemma, stem, syntactic dependency label, and whether the token contains numbers and its capitalization. The lexicons (dictionaries) include: stop words (non-content words such as the, or that), entries in ChemDB (J. et al., 2005), and hand-built lists of typical operations, conditions, entities, amounts, descriptors and apparatuses. Annotations were extracted using Stanford CoreNLP and NLTK. The CRF-hand model is a CRF which takes as per-token features the concatenation of its pre-trained word embedding with a binary feature vector indicating whether the token matched any of a set of hand-crafted rules over parse trees. CRF-both uses as features the concatenation of all the features from the CRF-ling and CRF-hand models.

We also evaluate three neural network-based models: a three-layer dilated CNN (DCNN), a bidirectional LSTM (Bi-LSTM), and a bidirectional LSTM used in conjunction with inference in a CRF (Bi-LSTM-CRF). All three models take the same input features for each token: its pre-trained word embedding and a learned embedding representing the word shape (capitalized, all caps, all lowercase, etc.) Rather than performing structured prediction in a CRF, the CNN and Bi-LSTM models all predict tokens using a logistic regression classifier (Eqn. 1) whose features are the token representations encoded by the neural network. Each neural network is trained end-to-end with the classifier. The Bi-LSTM-CRF performs structured inference in a CRF with inputs encoded by a Bi-LSTM. This model is also trained end-to-end with the partition function and its gradient computed using the forward-backward algorithm.

5.1.1 Data

We manually annotated text extracted using the PDF processing pipeline described in Section 4 from the methods section of 42 materials science papers from a variety of experimental materials science journals. We define a set of 18 entity type labels with types reflecting typical roles present in a synthesis route, such as materials, operations, targets and amounts: target, material, descriptor, amt_unit, cnd_misc, cnd_unit, intermed, operation, number, amt_misc, prop_unit, prop_type, prop_misc, synth_aprt, char_aprt, brand, meta, and ref. See Kim et al. (2017a)

for more details. In all, including tokens labeled as non-entities, the dataset at the time of writing consists of just under 10,000 tokens. We divided the 42 papers into 29 training (70%) and 13 test documents. In all models that use optimization or other hyperparameters, we simply use default settings (CRF models) or settings derived from tuning on data from a different domain (neural models).

We use the BILOU (Begin, Inside, Last, Outside, Unit) segment boundary encoding, augmenting each token’s label to indicate its position in the span of tokens making up the entity (for example, the two tokens in the span black solid might have the labels B-intermed, I-intermed). Previous work has found this encoding to result in improved performance (Ratinov and Roth, 2009).

5.1.2 Evaluation

We evaluate our entity extraction models using segment-wise precision, recall and F1, the harmonic mean of precision and recall. Since a single entity mention may consist of multiple tokens (for example, the entity

black solid consists of the tokens black and solid), we mark a prediction as correct only if every token in the entity is correctly labeled. In other words, there is no partial credit. We compute true positives as correctly predicted entities; False positives are entities whose first token predicted by our system does not align with the first token of a labeled entity; and similarly, false negatives are labeled entities whose first token our system failed to predict.

5.1.3 Results

Table 3 presents our entity extraction results. Table (a)a lists precision, recall and F1 achieved by each of the models. Among the CRF models using linguistic and hand-engineered features, we find that the CRF-hand model, which uses hand-engineered features over parse trees combined with word embeddings, out-performs the model which is given a wide array of linguistic features, including parse information. CRF-both, which combines both sets of features, outperforms each individual model, achieving an overall F1 of . Whereas the individual models are more precision-biased, CRF-both achieves much higher recall than either of the individual models though its precision suffers slightly, leading to the best overall F1 among these models.

All of the neural network-based models outperform the best feature-based CRF model, even those that predict using simple logistic regression. Among the models using logistic regression, the DCNN performs the best with an F1 of . The best performing model of all, by an insignificant margin, is the Bi-LSTM-CRF with an F1 score of , though its performance is comparable with the DCNN. This result is on par with those reported in Strubell et al. (2017), which show that the DCNN is a higher quality neural model for encoding rich token representations which incorporate wide context, and thus which do not require the structure from a CRF in order to form accurate predictions.

Model Precision Recall F1
CRF-ling 76.98 67.41 71.88
CRF-hand 75.59 69.32 72.48
CRF-both 74.97 72.12 73.52
DCNN 77.85 77.16 77.50
Bi-LSTM 74.25 77.83 76.00
Bi-LSTM-CRF 74.64 80.74 77.57
(a) Precision, recall and F1 scores of the entity extraction models. Neural network models out-performed feature engineered CRFs across the board.
Label CRF-both Bi-L-CRF DCNN Support
target 27.91 32.65 36.36 36
material 74.19 80.16 82.11 95
descriptor 57.58 62.03 62.64 64
amt_unit 83.48 83.48 83.48 45
cnd_misc 69.23 74.63 72.73 42
cnd_unit 95.10 94.52 93.06 91
intrmed 64.58 73.91 75.12 29
operation 80.95 82.76 82.55 146
number 87.71 91.89 88.67 114
(b) Breakdown of F1 score by label in the CRF, ID-CNN and Bi-LSTM-CRF models (abbreviated Bi-L-CRF) for labels with more than 10 annotated entities.
Table 3: Evaluation of the entity extraction models

Table (b)b lists F1 scores for the CRF-both, Bi-LSTM-CRF and DCNN models broken down by label, for labels which occurred more than ten times in the test data, along with the occurrence count for each label. There is a clear trend, where all the models tend to perform better on labels which have more support (we expect the distribution in the test data mimics that of the training data). For labels with fewer examples such as target and intermed, the DCNN tends to perform better.

5.2 Action Graph Extraction

5.2.1 Data

Our current dataset for this task consists of 240 materials science journal articles which were acquired using publisher approved APIs and text-mining interfaces. Details of this process are described in Kim et al. Kim et al. (2017a). Fifteen of these articles were annotated with the action graph structures. The 15 articles were selected to evenly represent different sub-types of inorganic syntheses (e.g., hydrothermal, sol-gel, solid state), and to ensure that annotated articles contained explicit synthesis descriptions of inorganic materials. Annotation was performed using an in-house web application.

The extraction pipeline was run on the entire set of 240 documents. Edges were induced by using the generative model described Kiddon et al. Kiddon et al. (2015) and by the sequential baseline model. We evaluate the results of both approaches by evaluating the predicted graphs for the 15 test cases. Next we describe our evaluation strategy for the action graphs.

5.2.2 Evaluation

Evaluating the extracted graphs involves comparing predicted and ground truth graphs. Our evaluation strategy involved a two step process. Given the two graphs we first align the sets of nodes in both graphs. Once aligned, we score individual edges by checking for its presence in the annotation, and compute an F1 score.

The alignments of nodes are made by means of exact token index matches between the nodes in the predicted graph and the annotated graphs. Using these alignments we report the fraction of nodes in the predicted graph which were aligned and unaligned as alignment scores. Since we match nodes based on token indices in the source text, we currently do not align implicit argument nodes, marking them as unaligned.

Given aligned nodes, we find true positive, false positive and false negative edges. We define these metrics as follows: True positives are edges present in the predicted graph and the annotated graph; false positives are edges present in the predicted graph but not in the annotated graph; and false negatives are edges not present in the predicted graph but present in the annotated graph. Based on these metrics we compute micro-averaged precision, recall and F1 on our test set.

5.2.3 Results

We evaluate the induced edges when the nodes have been generated from running our entire pipeline end to end (End-to-end evaluation in Table 4) and also in the case where the nodes have been induced from the annotations (Perfect node segmentation in Table 4). The second case simulates the case of having a perfect set of operation and argument nodes. We do this so as to analyze errors made by different stages of the pipeline (i.e., the event extraction and the edge induction models).

For both of the above cases, we also perform evaluations under two settings. In the first setting we ignore edges between unaligned nodes; We call this Setting 1 in Table 4. In the second setting we systematically penalize all edges which have one or both nodes unaligned in the predicted graph as being False Positives; We call this Setting 2 in Table 4. We do not distinguish between reference and association edges.

Model Aligned Unaligned Precision Recall F1 Precision Recall F1
End-to-end evaluation Setting 1 Setting 2
Sequential 39.85% 30.95% 73.04 94.38 82.35 27.10 27.91 27.50
Probabilistic 39.85% 30.95% 68.38 89.89 77.67 25.81 26.58 26.19
Perfect node segmentation Setting 1 Setting 2
Sequential 63.80% 0% 99.29 99.29 99.29 99.29 92.36 95.70
Probabilistic 63.80% 0% 95.36 95.36 95.36 95.36 88.70 91.91
Table 4: Evaluation of action graph extraction in terms of precision, recall and F1. End-to-end evaluation uses entities as predicted by our system, while perfect node segmentation uses the annotated entities. Setting 1 ignores all edges where at least one node is unaligned. Setting 2 penalizes edges where at least one node is unaligned as false positives.

The strong performance of the sequential model in both evaluation settings, in the end-to-end and the perfect node segmentation cases indicate the strong sequential nature in the data. Almost all intermediates derive directly from the previous operation. This is particularly apparent in the perfect node segmentation cases where the baseline has a F1 scores greater than 0.95. This seems to shed light on the nature of the data itself.

The alignment scores in the end-to-end case in both evaluation settings hints at the problems in the extraction of the action graph structures. The results above indicate that we currently extract about 56%333Explicit nodes aligned: of all non-implicit argument nodes in the predicted graphs. The major challenge in this extraction task seems to be being able to identify individual operations and their arguments. Next we present some conclusions and approaches we plan to pursue in future work.

6 Conclusions

In this work, we present models for extracting action graph structures from materials science synthesis procedures without access to any labeled target structures. Our experimental results highlight: (1) neural network models with word embedding features significantly outperform classic linear chain CRF model with manually designed features on the NER task on material science text despite a small training dataset. (2) merely resolving every intermediate as having arisen from the previous operation leads to very strong scores on our current dataset and evaluation metrics in the action graph extraction task, and (3) the major hurdle to extracting action graphs from the synthesis text is the accurate identification of its operations and arguments. Future work will explore improving entity and event extraction despite data scarcity, modeling event extraction as e.g. unsupervised n-ary relation extraction

(Verga et al., 2016), and end-to-end training of a single neural network which jointly models entities, events and actions.


We thank Kathryn Ricci and Zach Jensen for help with annotation and pipeline engineering. The authors would also like to acknowledge funding from the National Science Foundation Awards 1534340 (DMREF) and IIS-1514053, support from the Office of Naval Research (ONR) under Contract N00014-16-1-2432, the MIT Energy Initiative, the UMass Center for Data Science and the Center for Intelligent Information Retrieval, and in part by the Chan Zuckerberg Initiative under the Scientific Knowledge Base Construction project. E.K. was partially supported by NSERC, and E.S. was supported in part by an IBM Ph.D. Fellowship Award. The authors would also like to acknowledge support from seven major publishers who provided the substantial content required for our analysis.


  • Curtarolo et al. [2013] Stefano Curtarolo, Gus LW Hart, Marco Buongiorno Nardelli, Natalio Mingo, Stefano Sanvito, and Ohad Levy. The high-throughput highway to computational materials design. Nature materials, 12(3):191–201, 2013.
  • Jansen [2015] Martin Jansen. Conceptual inorganic materials discovery–a road map. Advanced Materials, 27(21):3229–3242, 2015.
  • Sumpter et al. [2015] Bobby G Sumpter, Sergei V Kalinin, and Richard K Archibald. Big-deep-smart data in imaging for guiding materials design. Nature materials, 14(10), 2015.
  • Gómez-Bombarelli et al. [2016] Rafael Gómez-Bombarelli, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, David Duvenaud, Dougal Maclaurin, Martin A Blood-Forsythe, Hyun Sik Chae, Markus Einzinger, Dong-Gwang Ha, Tony Wu, et al. Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. Nature materials, 15(10):1120–1127, 2016.
  • Meredig et al. [2014] Bryce Meredig, Ankit Agrawal, Scott Kirklin, James E Saal, JW Doak, Alan Thompson, Kunpeng Zhang, Alok Choudhary, and Christopher Wolverton. Combinatorial screening for new materials in unconstrained composition space with machine learning. Physical Review B, 89(9):094104, 2014.
  • Jain et al. [2013] Anubhav Jain, Shyue Ping Ong, Geoffroy Hautier, Wei Chen, William Davidson Richards, Stephen Dacek, Shreyas Cholia, Dan Gunter, David Skinner, Gerbrand Ceder, et al. Commentary: The materials project: A materials genome approach to accelerating materials innovation. Apl Materials, 1(1):011002, 2013.
  • Kirklin et al. [2015] Scott Kirklin, James E Saal, Bryce Meredig, Alex Thompson, Jeff W Doak, Muratahan Aykol, Stephan Rühl, and Chris Wolverton. The open quantum materials database (oqmd): assessing the accuracy of dft formation energies. npj Computational Materials, 1:15010, 2015.
  • Kim et al. [2017a] Edward Kim, Kevin Huang, Alex Tomala, Sara Matthews, Emma Strubell, Adam Saunders, Andrew McCallum, and Elsa Olivetti. Machine-learned and codified synthesis parameters of oxide materials. Nature Scientific Data, 4, 2017a.
  • Kiddon et al. [2015] Chloé Kiddon, Ganesa Thandavam Ponnuraj, Luke S. Zettlemoyer, and Yejin Choi. Mise en place: Unsupervised interpretation of instructional recipes. In EMNLP, 2015.
  • Kim et al. [2015] Sunghwan Kim, Paul A Thiessen, Evan E Bolton, Jie Chen, Gang Fu, Asta Gindulyte, Lianyi Han, Jane He, Siqian He, Benjamin A Shoemaker, et al. Pubchem substance and compound databases. Nucleic acids research, 44(D1):D1202–D1213, 2015.
  • Hachmann et al. [2011] Johannes Hachmann, Roberto Olivares-Amaya, Sule Atahan-Evrenk, Carlos Amador-Bedolla, Roel S Sánchez-Carrera, Aryeh Gold-Parker, Leslie Vogt, Anna M Brockway, and Alán Aspuru-Guzik. The harvard clean energy project: large-scale computational screening and design of organic photovoltaics on the world community grid. The Journal of Physical Chemistry Letters, 2(17):2241–2251, 2011.
  • Lawson et al. [2014] Alexander J Lawson, Jürgen Swienty-Busch, Thibault Géoui, and David Evans. The making of reaxys—towards unobstructed access to relevant chemistry information. In The Future of the History of Chemical Information, pages 127–148. ACS Publications, 2014.
  • Raccuglia et al. [2016] Paul Raccuglia, Katherine C Elbert, Philip DF Adler, Casey Falk, Malia B Wenny, Aurelio Mollo, Matthias Zeller, Sorelle A Friedler, Joshua Schrier, and Alexander J Norquist. Machine-learning-assisted materials discovery using failed experiments. Nature, 533(7601):73–76, 2016.
  • Grzybowski et al. [2009] Bartosz A Grzybowski, Kyle JM Bishop, Bartlomiej Kowalczyk, and Christopher E Wilmer. The’wired’universe of organic chemistry. Nature Chemistry, 1(1):31–36, 2009.
  • Coley et al. [2017] Connor W Coley, Regina Barzilay, Tommi S Jaakkola, William H Green, and Klavs F Jensen. Prediction of organic reaction outcomes using machine learning. ACS central science, 3(5):434–443, 2017.
  • Segler et al. [2017] Marwin HS Segler, Mike Preuss, and Mark P Waller. Learning to plan chemical syntheses. arXiv preprint arXiv:1708.04202, 2017.
  • Murray-Rust and Rzepa [1999] Peter Murray-Rust and Henry S Rzepa. Chemical markup, xml, and the worldwide web. 1. basic principles. Journal of Chemical Information and Computer Sciences, 39(6):928–942, 1999.
  • Hawizy et al. [2011] Lezan Hawizy, David M Jessop, Nico Adams, and Peter Murray-Rust. Chemicaltagger: A tool for semantic text-mining in chemistry. Journal of cheminformatics, 3(1):17, 2011.
  • Jones et al. [2014] David E Jones, Sean Igo, John Hurdle, and Julio C Facelli. Automatic extraction of nanoparticle properties using natural language processing: Nanosifter an application to acquire pamam dendrimer properties. PLoS One, 9(1):e83932, 2014.
  • Young et al. [2017] Steven R Young, Artem Maksov, Maxim Ziatdinov, Ye Cao, Matthew Burch, Janakiraman Balachandran, Linglong Li, Suhas Somnath, Robert M Patton, Sergei V Kalinin, et al. Data mining for better material synthesis: the case of pulsed laser deposition of complex oxides. arXiv preprint arXiv:1710.07721, 2017.
  • Ghadbeigi et al. [2015] Leila Ghadbeigi, Jaye K Harada, Bethany R Lettiere, and Taylor D Sparks. Performance and resource considerations of li-ion battery electrode materials. Energy & Environmental Science, 8(6):1640–1650, 2015.
  • Kim et al. [2017b] Edward Kim, Kevin Huang, Adam Saunders, Andrew McCallum, Gerbrand Ceder, and Elsa Olivetti. Materials synthesis insights from scientific literature via text extraction and machine learning. Chemistry of Materials, 2017b.
  • Schank and Abelson [1977] Roger C. Schank and Robert P. Abelson. Scripts, Plans, Goals and Understanding: an Inquiry into Human Knowledge Structures. L. Erlbaum, 1977.
  • Chambers and Jurafsky [2009] Nathanael Chambers and Dan Jurafsky. Unsupervised learning of narrative schemas and their participants. In Association for Computational Linguistics (ACL), 2009.
  • Balasubramanian et al. [2013] Niranjan Balasubramanian, Stephen Soderland, Oren Etzioni Mausam, and Oren Etzioni. Generating coherent event schemas at scale. In EMNLP, pages 1721–1731, 2013.
  • Pichotta and Mooney [2014] Karl Pichotta and Raymond J. Mooney. Statistical script learning with multi-argument events. In European Chapter of the Association for Computational Linguistics (EACL), April 2014.
  • Cheung et al. [2013] Jackie Chi Kit Cheung, Hoifung Poon, and Lucy Vanderwende. Probabilistic frame induction. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), June 2013.
  • Huang et al. [2017] De-An Huang, Joseph J. Lim, Fei-Fei Li, and Juan Carlos Niebles. Unsupervised visual-linguistic reference resolution in instructional videos. CoRR, abs/1703.02521, 2017.
  • Dong et al. [2009] Yuming Dong, Hongxiao Yang, Kun He, Shaoqing Song, and Aimin Zhang. -mno 2 nanowires: a novel ozonation catalyst for water treatment. Applied Catalysis B: Environmental, 85(3):155–161, 2009.
  • Swain and Cole [2016] Matthew C. Swain and Jacqueline M. Cole. Chemdataextractor: A toolkit for automated extraction of chemical information from the scientific literature. Journal of Chemical Information and Modeling, 56(10):1894–1904, 2016.
  • Lample et al. [2016] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. In NAACL, 2016.
  • Strubell et al. [2017] Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. Fast and Accurate Entity Recognition with Iterated Dilated Convolutions. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, September 2017.
  • Lafferty et al. [2001] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), pages 282–289, 2001.
  • Manning et al. [2014] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, 2014.
  • J. et al. [2005] Chen J., Swamidass S.J., Dou Y., Bruand J., and Baldi P. Chemdb: a public database of small molecules and related chemoinformatics resources. Bioinformatics, 21(22):4133–9, 2005.
  • Ratinov and Roth [2009] Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages 147–155. Association for Computational Linguistics, 2009.
  • Verga et al. [2016] Patrick Verga, David Belanger, Emma Strubell, Benjamin Roth, and Andrew McCallum. Multilingual relation extraction using compositional universal schema. In HLT-NAACL, 2016.