Automated scientific document processing (SDP) deploys natural language processing on scholarly articles, which are long-form, complex documents with conventional structure and cross-reference to external resources. Representative SDP tasks include: parsing embedded reference strings; identifying the importance, sentiment and provenance for citations; identifying logical sections and markup; parsing of equations, figures and tables; and summarization. SDP tasks, in turn, help downstream systems and assist scholars in finding relevant documents and manage their knowledge discovery and utilization workflows. Next-generation introspective digital libraries such as Semantic ScholarAmmar et al. (2018) have begun to incorporate such services.
While natural language processing (NLP) in general has seen tremendous progress with the introduction of neural network architectures and general toolkits and datasets to leverage them, their deployment for SDP is still limited. Over the past few years, many open-source software packages have accelerated the development of state of the art NLP models. Hugging Face’s transformer modelsWolf et al. (2019) and AllenNLP Gardner et al. (2017) are general-purpose frameworks that have produced state-of-the-art models for natural language understanding tasks. However, these and many other tools do not provide comprehensive access to pre-trained scientific document processing models.
A key barrier to entry is accessibility: a non-trivial amount of expertise in NLP and machine learning methodologies is a prerequisite, which many scholars who wish to deploy SDP lack and have no interest in obtaining. Thus there is a clear need for a toolkit that packages easy access to pre-trained, state-of-the-art models, while also allowing researchers to experiment with models rapidly to create deployable applications.
We introduce SciWING to close this gap. Built on top of PyTorch and under active development, it provides easy access to modern neural network models trained for a growing number of SDP tasks which practitioners can easily deploy on their documents. For researchers, these models serve as baselines for experimentation and the basis for the easy construction of more complex architectures in a modular manner. SciWING affords the swapping different neural network modules, allowing researchers to declare models in a configuration file without having to write programming code.
2 System Overview
Our view is that SDP-specific considerations are best embodied as an abstract layer over existing NLP frameworks. SciWING incorporates the generic NLP pipeline AllenNLP Gardner et al. (2017), developing models on top of it, while using the transformers package to enable transfer learning via its pre-trained general-purpose transformers such as BERT Devlin et al. (2019) and SDP-specific ones; i.e., SciBERT Beltagy2019SciBERT.
SciWING builds with Python 3.7 and is provisioned as a package available on the Python Packaging Index (PyPI), supporting installation via pip install sciwing. This downloads its PyTorch and AllenNLP library dependencies. For users aiming to develop the library further, SciWING comes with installation tools that set up the system, alongside extensive in-lined code documentation, and explanatory tutorials.
SciWING separates Dataset, Model and Engine components (Fig. 1), facilitating their flexible re-configuration:
Datasets: There are many challenges for the researcher-practitioner to experiment with different SDP tasks. First, researchers are often dealt with the challenge of handling various formats of the datasets: for reference string parsing, the CoNLL format is most common; for text classification, the CSV format is most common. SciWING enables reading of dataset files in different formats and also facilitates the easy download of open datasets using command-line interfaces. For example, sciwing download scienceie downloads the official openly available dataset for the ScienceIE task. Additionally, pre-processing is cumbersome and error-prone. It becomes complex when different models require different tokenisation and numericalisation methods. SciWING unifies these various input formats through a pipeline of pre-processing, tokenisation and numericalisation, providing automatic means to pre-process via Tokenisers and Numericalisers
. SciWING also handles batching and padding of examples.
Models: The below two subcomponents are combined to build an instance of a neural network model. The models are PyTorch based classes.
: Modern NLP models represent natural language tokens as continuous vectors – embeddings. SciWING abstracts this concepts viaEmbedders. Generic (non-SDP specific) embeddings such as GlovE Pennington et al. (2014) are natively provided. Tokens in scientific documents can benefit from special attention, as most are missing from pre-trained embeddings. SciWING includes task-specific trained embeddings for reference strings Prasad et al. (2018). Via the transformers package, SciWING supports contextual word embeddings: BERT Devlin et al. (2019), SciBERT Beltagy2019SciBERT and ELMo Peters et al. (2018). State-of-the-art models are easily built by concatenating multiple representations, via SciWING’s ConcatEmbedders module. For example, word and character embeddings are combined in NER models Lample et al. (2016), multiple contextual word embeddings are combined in various clinical and Bio-NLP tasks Zhai et al. (2019).
Neural Network Encoders: SciWING consists of commonly-used neural network components that can be composed to form neural architectures for different tasks. For example in text classification, encoding input sentence as a vector using an LSTM is a common task (SciWING’s seq2vecencoder). Another common operation is obtaining a sequence of hidden states for a set of tokens, often used in sequence labelling tasks (SciWING’s Lstm2seq).
SciWING has generic linear classification and sequence labelling with CRF heads that can be attached to the encoders to build the final model. It provides pretrained state-of-the-art models for particular SDP tasks that work out-of-the-box for prediction or which can be further fine-tuned.
Engine: SciWING handles all the boilerplate code to train the model, monitor the loss and metrics, check-pointing parameters at different stages of training, validation and testing. It helps researchers adopt best practices, such as clipping gradient-norms, saving and deploying best performing models. SciWING’s experimentation framework is flexible and the users can use different options for the following
Optimisers: SciWING supports all the optimisers supported by PyTorch. It also supports various learning rate schedulers which dynamically manage learning rates, based on performance on the validation dataset.
Experiment Logging: Logging the hyper-parameters of the model is a tedious task. However, many online logging tools have alleviated the difficulty of managing experiments. SciWING writes logs for every experiment run and also facilitates cloud-based experiment logging and corresponding charting of relevant metrics. via the third-party API service Weights and Biases. Integrating alternative services for experiment logging is work-in-progress.
Metrics: Different SDP tasks require their respective metrics. SciWING abstracts a separate Metrics module to select appropriate metrics for each task. SciWING includes PrecisionRecallFMeasure suitable for text classification tasks, and TokenClassificationAccuracy
suitable for sequence labelling, and official shared task metrics such as the ConLL2003 evaluation metric222Based on the official script from CoNLL. for sequence labelling.
With these components given, SciWING’s Inference middleware provides clear abstractions to perform inference once models are trained. The layer runs predictions on the test dataset, user inputs and files. Such abstractions also act as an interface for the development of upstream REST APIs and command-line applications.
2.1 Configuration using TOML files
A defining feature of SciWING is its use of a declarative TOML configuration file333Commonly used for Python applications. This enables users to declare dataset, model architectures and experiment hyper-parameters in a single place. SciWING parses the TOML file and creates appropriate instances of the dataset, model and other experiment hyper-parameters.
A simple configuration file for reference string parsing is shown in Figure 2
along with its equivalent model declaration in Python. The class declaration in python and configuration file have a one-to-one correspondence. As deep learning models are made of multiple modules, SciWING adopts a strategy to automatically instantiate these submodules as needed. SciWING constructs a Directed Acyclic Graph (DAG) from the model definition to achieve this. The DAG’s topological ordering is used to instantiate the different submodules to form the final model, as described next.
2.2 Command Line Interface
Qualitatively analysing the results of the model by drilling down to certain training and development instances can be telling and help to diagnose performance issues. SciWING provides an interactive inspection of the model for this reason through a command-line interface (CLI). Consider the task of reference string parsing: the confusion matrix for the different classes can be displayed through the provided CLI utility, which also allows finer-grained introspection of (Precision, Recall, F-measure) metrics and the viewing of error instances where one class is confused for another.
SciWING provides commands to run experiments from the configuration file, aiding replication. For example, if the experiments are declared in a file named experiment.toml, then the experiments can be run with the command sciwing run experiment.toml. This runs the experiment, saving the best model. Inference is then trivially invoked via sciwing test experiment.toml which deploys the best model against the test dataset and display the resultant metrics.
2.3 End User Interfaces
API service enables easy development of various Graphical User Interfaces. SciWING currently exposes APIs for reference string parsing and citation intent classification. using fastapi 444https://fastapi.tiangolo.com/:
The API enables the following application families downstream:
Web demonstrations provide quick access to predictions from state-of-the-art models, fulfilling one key aim of SciWING. Prespecified data can be chosen or user data can be entered and quickly processed using the distributed models (as in Figure 6).
Programmatic Interfaces in SciWING provisions more advanced usage. Users can make predictions for data stored in a file or fine-tune model on their data. For example, if the user wants to parse the citations, where a text file contains all the citations, then SciWING provides a NeuralParscit class that has easy methods to parse all the strings in a file and store it in a new file.
3 Example Tasks
SciWING includes examples for various tasks which finds widespread use in scientific document processing. The examples demonstrate how to use the framework effectively. The models have verified performance levels that closely match the performance of the original results. They can be used as baselines for further research.
Reference String Parsing identifies the components of a reference string that corresponds with a in-document citation: author, journal and year of publication, among 13 classes. Neural network methods for reference string parsing show state-of-the-art performance Prasad et al. (2018) as a sequence labelling task, combining a bidirectional LSTM with CRF. SciWING’s distributed model implements the same model architecture, also uses ELMo embeddings.
ScienceIE identifies typed keyphrases, originally from chemical documents: Task keyphrases that denote the end task or goal, Material keyphrases indicate any chemical, and Dataset that is being used by the scientific work and the process includes any scientific model or algorithm. The state of the art system from 2017 includes a bidirectional LSTM with CRF and uses language model embeddings Ammar et al. (2017). SciWING includes a reference implementation substituting modern ELMo embeddings for the original language model representation.
Logical Structure Recovery identifies the logical sections of a document: introduction, related work, methodology, and experiments. This drives the relevant, targeted text to downstream tasks like summarization, citation intent classification, among others. Currently, there are no neural network methods for this task, so SciWING’s models can serve as baselines for future research.
Other tasks and datasets are being actively developed. We envision the development of additional models will be swift and easy.
4 Use Case: Reference String Parsing
We illustrate how to construct SciWING models, building up to the state-of-the-art model by simple modifications. This also facilitates ablation studies, common part to empirical studies.
: Our base model is a bidirectional LSTM model. It uses a GLoVE embedder. Every input token is classified into one of 13 different classes.
Bi-LSTM Tagger adding a CRF layer: We then modify the above code, swapping the simple tagger with one that uses a CRF. The rest of the code is identical.
Bi-LSTM tagger with character and ELMo Embeddings: We modify the code to include a bidirectional LSTM character embedder. We use the ConcatEmbedders module to create the final word embeddings (Line 16), which concatenates the character embeddings with those from the previous word embedding and a pretrained ELMo contextual word embedding. This final model is the provisioned model for the reference string parsing task provided in SciWING.
5 Related Work
Grobid 5 is the closest to a general workbench for scientific document processing. Similarly to SciWING, Grobid also performs document structure classification, reference string parsing, among other tasks. However, Grobid is not a deep learning framework for scientific document processing. SciSpacy Neumann et al. (2019) focuses on biomedical related tasks like POS-tagging, syntactic parsing and biomedical span extraction. However, SciSpacy primarily for deployment, as it does not allow development and testing of new models and architectures.
In contrast, task- and domain-agnostic frameworks also exist. NCRF++ Yang and Zhang (2018) is a tool for performing sequence tagging using Neural Networks and Conditional Random Fields. is a general NLP framework. SciWING interfaces with general purpose AllenNLP Gardner et al. (2017) NLP framework and allows easy access to pre-trained neural network models for scientific document processing. The Transformers framework Wolf et al. (2019) enables simple access to pre-trained transformer architectures such as BERT, XLNet, Alberta and Roberta. SciWING also builds on top of the transformers package to give researchers easy access to general-purpose and scientific document processing specific contextual word embeddings like SciBERT for its tasks.
We introduce SciWING, an open-source scholarly document processing (SDP) toolkit, targeted at practitioners and researchers interested in rapid experimentation. It provisions pre-trained models for key SDP tasks — such as citation string parsing and citation intent classification — that achieve state-of-the-art performance.
SciWING’s modular design greatly facilitates model architecture development, speedy train/test cycles for architecture search, and supports transfer learning for use cases with limited annotated data. Configuration driven, SciWING allows the user to declare models, datasets and experiment parameters in a single configuration file.
SciWING is actively being developed. Our current development target is to incorporate models for sequence to sequence (seq2seq) generation, and multi-task learning (to ameliorate challenges with sparse data), alongside the implementation of additional SDP tasks and models. We hope that SciWING fosters collaboration among the SDP community, and encourage assistance with these goals through contributions on our Github repository.
- Construction of the literature graph in semantic scholar. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), New Orleans - Louisiana, pp. 84–91. External Links: Cited by: §1.
- The AI2 system at SemEval-2017 task 10 (ScienceIE): semi-supervised end-to-end entity and relation extraction. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 592–596. External Links: Cited by: §3.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: 2nd item, §2.
- AllenNLP: a deep semantic natural language processing platform. External Links: Cited by: §1, §2, §5.
-  (2008–2020) GROBID. GitHub. Note: https://github.com/kermitt2/grobid External Links: Cited by: §5.
Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270. External Links: Cited by: 2nd item.
- ScispaCy: fast and robust models for biomedical natural language processing. External Links: Cited by: §5.
- Glove: global vectors for word representation. In EMNLP, Doha, Qatar. Cited by: 2nd item.
- Deep contextualized word representations. In NAACL-HLT, Cited by: 2nd item.
- Neural parscit: a deep learning-based reference string parser. Int. J. Digit. Libr. 19 (4), pp. 323–337. External Links: Cited by: 2nd item, §3.
- HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §1, §5.
- NCRF++: an open-source neural sequence labeling toolkit. In Association for Computational Linguistics, External Links: Cited by: §5.
- Improving chemical named entity recognition in patents with contextualized word embeddings. In Proceedings of the 18th BioNLP Workshop and Shared Task, Florence, Italy, pp. 328–338. External Links: Cited by: 2nd item.