The open source community has contributed many natural language processing (NLP) toolkits to the research and industry organizations, lowering the barrier of entry to access computational structures. Despite their wide usage, many NLP toolkits suffer from a major limitation that their architectures are bounded by the pipeline designManning et al. (2014); Straka and Straková (2017); Gardner et al. (2018); Akbik et al. (2019); Qi et al. (2020), that leads to error propagation, large memory consumption, and high latency.Although toolkits from the industry, such as spaCy, have started to exploit Multi-Task Learning (MTL), a lot of key components like semantic role labeling, constituency parsing and Abstract Meaning Representation parsing are not generally available. Additionally, they lack the ability to serve massive concurrent requests from the web, due to the lack of an efficient requests batching and corresponding multi-processing mechanism.
In the face of these challenges, we introduce ELIT, an efficient yet accurate and fast NLP toolkit supporting the largest number of core NLP tasks with the boost of state-of-the-art transformer encoders.Compared to existing popular NLP toolkits, ELIT excels with the following outstanding advantages:
MTL Framework ELIT is powered by an efficient MTL framework to accommodate many NLP tasks, spanning from surface tasks, syntactic tasks to semantic tasks.
State-of-the-Art Performance Backed up by the latest transformer encoder and decoders, ELIT establishes state-of-the-art on most tasks and yields comparable results on the others.
Requests Batching for Concurrency In order to scale up its inference, ELIT provides a built-in multi-worker web server featured with an efficient requests batching mechanism.
ELIT embraces the NLP community with a fully open source license and lots of pre-trained models for public download. We hope ELIT can facilitate NLP research and applications, and bring the benefits of the NLP techniques to broader audiences.
2 System Architecture
From the user’s view, ELIT provides 2 sets of APIs: (1) a native multi-learning task framework coupled with a rule-based tokenizer (see Figure 1); (2) a RESTful client interface to the MTL server. In this section, we introduce their architecture designs.
Multi-Task Learning (MTL)
Given a tokenized sentence (or document), ELIT employs a sub-word tokenizer to further tokenize each token into sub-tokens. These sub-tokens are fed into a fine-tuned Transformer Encoder (TE) to get their contextualized embeddings. For those sentences longer than the maximum input length of the TE, a sliding window routine is invoked to mitigate the issue.
As in Algorithm 1, the sliding_window sub-routine takes input as a list of sub-tokens and slices them into windows with the maximal size , which will be fed into a Transformer Encoder. Then the hidden states are restored using the restore sub-routine to match the original sequence such that in each window the inner parts will be used (Figure 2). Then, average pooling is applied to sub-token embeddings to get the corresponding token embeddings. Finally, the token embeddings are fed into the decoder of every task in parallel. This MTL architecture apportions the cost of TE between decoders, yielding lower overall latency than running TE individually for each of the decoders. It also reduces the training cost and deployment efforts due to its compact structure.
In scenarios with strict latency requirements such as a dialogue system or real-time machine translator, not even the native MTL can respond timely to highly concurrent requests due to the Global Interpreter Lock (GIL) of Python111 We also investigated serving techniques free of GIL such as PyTorch JIT and ONNX. However, these techniques usually convert the model into a symbolic (static) representation, hampering features such as dynamic tasks scheduling and length-based batching. Thus, we leave them for future work.
We also investigated serving techniques free of GIL such as PyTorch JIT and ONNX. However, these techniques usually convert the model into a symbolic (static) representation, hampering features such as dynamic tasks scheduling and length-based batching. Thus, we leave them for future work.. To scale up the inference, we also implement a RESTful Client/Server with requests batching in ELIT as illustrated by Figure 3.
On the server side, HTTP requests are bucketed into batches according to their arrival time, and predicted concurrently by several worker processes spawned over the CPUs and GPUs. It is possible that sentences from different requests are put into the same batch, as long as the requested tasks are the same. When no worker is idle, a First-In-First-Out (FIFO) queue is used to store the new coming requests. On the client side, the requests batching mechanism is transparent and each user is able to parse their documents in the same way as using a native MTL API exclusively. Note that both APIs share similar semantics and return the same JSON format so that users can easily switch APIs without breaking changes.
Within the MTL architecture, ELIT provides a set of decoders for widely used NLP tasks. Each task is supported by a state-of-the-art decoder. In this section, we will briefly introduce each of them.
For efficiency, a linear layer is used to predict the part-of-speech tags. We acknowledge the effectiveness of character level and case features (Bohnet et al., 2018; Akbik et al., 2018) for POS, however, the improvements to accuracy brought by these features are marginal compared to the latency they introduce in our setting. Thus, we do not integrate them.
The two-stage CRF (Zhang et al., 2020) decoder is used for CON which is optimized using a tree-structure CRF objective on unlabeled constituents. POS features are avoided to remove the dependency on the POS decoder.
The end-to-end span ranking decoder (He et al., 2018) is used for SRL. Their attention-based span representations are replaced by average pooled embeddings for simplicity.
The graph sequence transduction decoder (He and Choi, 2021) is used for AMR. Linguistic features including POS, NER and LEM embeddings are removed for efficiency.
Separated from the MTL architecture, ELIT also provides two coreference resolution components that perform the traditional document-level decoding (DCR) as well as online decoding (OCR).
The model takes a current utterance and conversation context as the input, extracting mentions in the current utterance (including singletons), and resolving their coreference with previously predicted mentions from the past history.
3 System Usage
3.1 Mtl Api
The user interface of ELIT is designed to hide the underlying complexity from end users, allowing them for agile developments of NLP models. To fulfill this goal, ELIT packs all sophisticated work required by other NLP toolkits, such as downloading models, using the right class to create an instance, loading and deploying it to GPU where possible, feeding the outputs from one component to another, into the following 3 lines of codes:
As indicated by the identifier passed to the load call, LEM_POS_NER_DEP_SDP_CON_AMR_EN takes input sentences and performs LEM, POS, NER, DEP, SDP, CON and AMR jointly. The interface for coreference resolution is the same as MTL except that the order of input sentences is expected to be consistent with the document. The following snippet loads the coreference model with SpanBERTLarge and perform predictions for DCR:
3.2 RESTful API
In a more common setting where multiple users require their documents to be parsed by ELIT, its web server can be set up to handle concurrent requests efficiently:
Note that we design the native MTL API and the RESTful API such that they share the same semantics and can be used interchangeably. The only subtle difference is that, the native MTL accepts only tokenized sentences whereas the RESTful API additionally accepts raw text which will be tokenized, segmented to sentences on the server side.
3.3 Output Format
In both MTL and RESTful APIs, a Document instance will be returned to the user, which is a Python dict storing all annotation results. For each task, its annotations are associated to a key indicated by its task name, e.g., the above doc will have the following structure:
tok stores the surface form of each token.
lem stores the lemmatization of each token.
pos stores the part-of-speech tag of each token. In this component, the Penn Treebank Part-of-speech tags Santorini (1990) are used.
ner stores the (type, start, end, form) of each entity. In this component, the OntoNotes 5 NER annotations Weischedel et al. (2013) are used.
srl stores the (role, start, end, form) of the predicates and arguments corresponding to each flattened predicate-argument structure. In this component, the OntoNotes 5 SRL annotations Weischedel et al. (2013) are used with an additional role PRED indicating the predicate.
dep stores the (head, relation) of each token, with the offset starting from -1 (ROOT). In this component, the primary dependency of Deep Dependency Graph Representation Choi (2017) is used. The full representation with secondary dependencies are provided in another component.
con stores the constituency trees, specifically (label, child-constituents) for the non-terminal constituents and form for the terminals. Note that we designed a nested list representation with the round brackets replaced by square brackets to avoid ambiguity and make it compatible with JSON. When not being printed out, the Document class will convert our nested list structure to the conventional bracketed tree.
amr stores the logical triples of Abstract Meaning Representation in the format of (source, relation, target). Note that the Document class will convert it to Penman format Goodman (2020) when being accessed through code.
For the DCR example shown before, the following output should be generated:
where dcr contains a list of clusters, and each cluster consists of spans referring to the same entity, in the format of (sentence-index, token-start, token-end, text). The input and output formats for OCR are shown in details on the GitHub page 222https://github.com/emorynlp/elit/blob/main/docs/data_format_ocr.md.
In ELIT, the APIs to train a new model are as strait-forward as the inference APIs. Instead of resorting to a configuration file, we opt for the native Python APIs which offer built-in documentations of each parameter and the necessary type checking. Sticking to the Python APIs requires no extra efforts to learn another language to write config files. These benefits are illustrated in the following code snippet, which demonstrates how to train a joint NER and DEP component with easy integration with the RoBERTa encoder:
For each task, its popular training file format (e.g., tsv, CONLL-U) is well retained such that users can make use of the readily available open access datasets on the web. Note that our SortingSamplerBuilder builds a sampler which groups sequences similar in length into the same batch such that training and inference will be significantly accelerated. The training for coreference resolution models are performed outside ELIT, for which users can refer to our documentation.
4 Performance Evaluation
4.1 Datasets and Metrics
The ELIT models are trained on a mixture of OntoNotes 5 (Weischedel et al., 2013), BOLT English Treebanks (Song et al., 2019; Tracey et al., 2019), THYME/SHARP/MiPACQ Treebanks (Albright et al., 2013), English Web Treebank (Bies et al., 2012), Questionbank (Judge et al., 2006) and AMR 3.0 dataset (Knight et al., 2020). Batches of NLP tasks are mixed together so that even if a corpus offers no annotation for some tasks, it can still be exploited by ELIT. For LEM, POS, NER, DEP, CON and SRL, the mixed corpora are split into training, development and test set with a 8:1:1 ratio. For AMR, the standard splits are used.
The following evaluation are used for each task - LEM and POS: accuracy, NER: span-level labeled F1, DEP: labeled attachment score, CON: constituency-level labeled F1, SRL: micro-averaged F1 of predicate-argument-label triples, AMR: Smatch (Cai and Knight, 2013), DCR/OCR: Averaged F1 of MUC, B3, and CEAF.
4.3 Accuracies and Speed
The ELIT MTL model is trained and evaluated using a single TITAN RTX GPU for 30 epochs, which take roughly 36 hours. The scores on test set and decoding speed are listed in Table 2.
To make relatively fair comparisons with the existing toolkits and models from published work, we train single-task learning models on OntoNotes 5 and CoNLL-2012 as well with the standard split for NER, SRL and Coref. The results are shown in Table 1. We find that ELIT achieved either higher or close performance in comparison to existing toolkits or models.
5 Conclusion and Future Work
We introduced ELIT, the Emory Language and Information Toolkit providing the largest number of NLP tasks within an efficient MTL framework. Our APIs demonstrate the usability of ELIT within several function calls. The interchangeable RESTful API further extends ELIT with extra features and enables it to serve large-scale concurrent requests. For future work, we plan the followings:
Train multilingual models since the design of ELIT is language agonistic.
Integrate statistical tokenizers to replace the rule-based one and to support non-tokenized languages like Korean and Chinese.
Exploit model distillation and compression techniques to reduce the size of transformer encoders.
- FLAIR: an easy-to-use framework for state-of-the-art nlp. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 54–59. Cited by: §1, 0(a).
- Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1638–1649. External Links: Cited by: §2.2.
- Towards comprehensive syntactic and semantic annotations of the clinical narrative. Journal of the American Medical Informatics Association 20 (5), pp. 922–930. Cited by: §4.1.
- English web treebank. Linguistic Data Consortium, Philadelphia, PA. Cited by: §4.1.
- Morphosyntactic tagging with a meta-BiLSTM model over context sensitive token encodings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2642–2652. External Links: Cited by: §2.2.
Smatch: an evaluation metric for semantic feature structures. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 748–752. Cited by: §4.1.
- Deep dependency graph conversion in english.. In TLT, pp. 35–62. Cited by: 6th item.
- Simple data-driven context-sensitive lemmatization. Cited by: §2.2.
- On the shortest arborescence of a directed graph. Scientia Sinica 14, pp. 1396–1400. Cited by: §2.2.
- Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. Cited by: §4.2.
- Deep Biaffine Attention for Neural Dependency Parsing. In Proceedings of the 5th International Conference on Learning Representations, ICLR’17. External Links: Cited by: §2.2, §2.2.
- Optimum branchings. Journal of Research of the national Bureau of Standards B 71 (4), pp. 233–240. Cited by: §2.2.
- AllenNLP: a deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), Melbourne, Australia, pp. 1–6. External Links: Cited by: §1.
- Penman: an open-source library and tool for amr graphs. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 312–319. Cited by: 8th item.
- Levi graph amr parser using heterogeneous attention. In Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies, Cited by: §2.2.
- Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 364–369. External Links: Cited by: §2.2.
- Generalizing natural language analysis through span-relation representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 2120–2133. External Links: Cited by: 0(b).
- SpanBERT: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8, pp. 64–77. External Links: Cited by: §2.2, 0(c).
- BERT for coreference resolution: baselines and analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5803–5808. External Links: Cited by: §2.2.
- Questionbank: creating a corpus of parse-annotated questions. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 497–504. Cited by: §4.1.
- Abstract meaning representation (amr) annotation release 3.0. Technical report Technical Report LDC2020T02, Linguistic Data Consortium, Philadelphia, PA, June. Cited by: §4.1.
- 75 languages, 1 model: parsing Universal Dependencies universally. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2779–2795. External Links: Cited by: §2.2.
Dependency or span, end-to-end uniform semantic role labeling.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6730–6737. Cited by: 0(b).
- Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §4.2.
- The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp. 55–60. Cited by: §1.
- Joint lemmatization and morphological tagging with lemming. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2268–2274. Cited by: §2.2.
- Stanza: a python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, pp. 101–108. External Links: Cited by: §1, 0(a).
- Part-of-speech tagging guidelines for the penn treebank project (3rd revision). Technical Reports (CIS), pp. 570. Cited by: 3rd item.
- BOLT english sms/chat. Cited by: §4.1.
- Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with udpipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. Cited by: §1.
- BOLT english discussion forums. Cited by: §4.1.
- Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA. Cited by: 4th item, 5th item, §4.1.
- Revealing the myth of higher-order inference in coreference resolution. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 8527–8533. External Links: Cited by: §2.2, 0(c).
- A unified generative framework for various ner subtasks. arXiv preprint arXiv:2106.01223. Cited by: 0(a).
- Named entity recognition as dependency parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 6470–6476. External Links: Cited by: §2.2, 0(a).
- Fast and Accurate Neural CRF Constituency Parsing. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, C. Bessiere (Ed.), pp. 4046–4053. Note: Main track External Links: Cited by: §2.2.