Official Stanford NLP Python Library for Many Human Languages
We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition. We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested. Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionalities to cover other tasks such as coreference resolution and relation extraction. Source code, documentation, and pretrained models for 66 languages are available at https://stanfordnlp.github.io/stanza.READ FULL TEXT VIEW PDF
We introduce Trankit, a light-weight Transformer-based Toolkit for
We introduce N-LTP, an open-source Python Chinese natural language proce...
We present fastHan, an open-source toolkit for four basic tasks in Chine...
Named entity recognition (NER) is a widely applicable natural language
In this paper, we describe Function Assistant, a lightweight Python-base...
We introduce pytrec_eval, a Python interface to the tree_eval informatio...
We describe an LSTM-based model which we call Byte-to-Span (BTS) that re...
Official Stanford NLP Python Library for Many Human Languages
The growing availability of open-source natural language processing (NLP) toolkits has driven rapid development of computational approaches to study human languages. While existing NLP toolkits such as CoreNLP Manning et al. (2014), Flair Akbik et al. (2019), spaCy111https://spacy.io/, and UDPipe Straka (2018) have had wide usage, they also suffer from several limitations. First, existing toolkits often support only several major languages. This has significantly limited the community’s ability to process multilingual text. Second, widely used tools are sometimes under-optimized for accuracy, potentially misleading downstream applications and insights obtained from them. Third, they sometimes assume input text has been tokenized or annotated with other tools, lacking the ability to process raw text with a unified framework. This has limited their wide applicability to text from diverse sources.
|System||# Human Languages||Programming Language||Raw Text Processing||Fully Neural||Pretrained Models||State-of-the-art Performance|
We introduce Stanza222The toolkit was named as StanfordNLP prior to version 0.3.0., a Python natural language processing toolkit supporting many human languages. As shown in Table 1, compared to existing widely-used NLP toolkits, Stanza has the following advantages:
From raw text to annotations. Stanza features a fully neural pipeline which takes raw text as input, and produces annotations including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.
Multilinguality. Stanza’s architectural design is language-agnostic and data-driven, which allows us to release models supporting 66 languages, by training the pipeline on the Universal Dependencies (UD) treebanks and other multilingual corpora.
State-of-the-art performance. We evaluate Stanza on a total of 112 datasets, and find its neural pipeline adapt well to text of different genres, achieving state-of-the-art or competitive performance at each step of the pipeline.
Additionally, Stanza features a Python interface to the widely used Java CoreNLP software, allowing access to richer functionalities such as coreference resolution and relation extraction.
Stanza is fully open-source and we make pretrained models for all supported languages and datasets available for public download. We hope Stanza can facilitate multilingual NLP research and applications, and drive future research that produces insights from human languages.
At a high level, Stanza consists of two individual components: (1) a fully neural multilingual NLP pipeline; (2) a Python client interface to the Java Stanford CoreNLP software. In this section we introduce their designs.
Stanza’s neural pipeline consists of models that range from tokenizing raw text to performing syntactic analysis on entire sentences (see Figure 1). All components are designed with processing many human languages in mind, with high-level design choices capturing common phenomena in many languages and data-driven models that learn the difference between these languages from data. Moreover, the implementation of Stanza components is highly modular, and reuses basic model architectures when possible for compactness. We highlight the important design choices here, and refer the reader to Qi et al. (2018) for modeling details.
When presented raw text, Stanza tokenizes it and groups tokens into sentences as the first step of processing. Unlike most existing toolkits, Stanza combines tokenization and sentence segmentation from raw text into a single module. This is modeled as a tagging problem over character sequences, where the model predicts whether a given character is the end of a token, end of a sentence, or end of a multi-word token (MWT, see Figure 2).333Following Universal Dependencies Zeman et al. (2019), we make a distinction between tokens (contiguous spans of characters in the input text) and syntactic words. These are interchangeable aside from the cases of MWTs, where one token can correspond to multiple words. We choose to predict MWT jointly because this task is context-sensitive in some languages.
|(fr) L’Association des Hôtels|
|(en) The Association of Hotels|
|(fr) Il y a des hôtels en bas de la rue|
|(en) There are hotels down the street|
Once MWTs are identified by the tokenizer, they are expanded into the underlying syntactic words as the basis of downstream processing. This is achieved with an ensemble of a frequency lexicon and a neural sequence-to-sequence (seq2seq) model, to ensure that frequently observed expansions in the training set are always robustly expanded while maintaining flexibility to model unseen words statistically.
For each word in a sentence, Stanza assigns it a part-of-speech (POS), and analyzes its universal morphological features (UFeats, e.g., singular/plural, 1st/2nd/3rd
person, etc.). To predict POS and UFeats, we adopt a bidirectional long short-term memory network (Bi-LSTM) as the basic architecture. For consistency among universal POS (UPOS), treebank-specific POS (XPOS), and UFeats, we adopt the biaffine scoring mechanism fromDozat and Manning (2017) to condition XPOS and UFeats prediction on that of UPOS.
Stanza also lemmatizes each word in a sentence to recover its canonical form (e.g., diddo
). Similar to the multi-word token expander, Stanza’s lemmatizer is implemented as an ensemble of a dictionary-based lemmatizer and a neural seq2seq lemmatizer. An additional classifier is built on the encoder output of the seq2seq model, to predictshortcuts such as lowercasing and identity copy for robustness on long input sequences such as URLs.
Stanza parses each sentence for its syntactic structure, where each word in the sentence is assigned a syntactic head that is either another word in the sentence, or in the case of the root word, an artificial root symbol. We implement a Bi-LSTM-based deep biaffine neural dependency parser Dozat and Manning (2017). We further augment this model with two linguistically motivated features: one that predicts the linearization order of two words in a given language, and the other that predicts the typical distance in linear order between them. We have previously shown that these features significantly improve parsing accuracy Qi et al. (2018).
For each input sentence, Stanza also recognizes named entities in it (e.g., person names, organizations, etc.). For NER we adopt the contextualized string representation-based sequence tagger as in Akbik et al. (2018). We first train a forward and a backward character-level LSTM language model, and at tagging time we concatenate the representations at the end of each word position from both language models with word embeddings, and feed the result into a standard one-layer Bi-LSTM sequence tagger with a conditional random field (CRF)-based decoder.
Stanford’s Java CoreNLP software provides a comprehensive set of NLP tools especially for the English language. However, these tools are not easily accessible with Python, the programming language of choice for many NLP researchers and practitioners, due to a lack of official support. To facilitate the use of CoreNLP from Python, we take advantage of the existing server interface in CoreNLP, and implement a robust client as its Python interface.
When a user instantiates the CoreNLP client, Stanza will automatically start the CoreNLP server as a local process. The client then communicates with the server through its RESTful APIs, after which annotations are transmitted in Protocol Buffers, and converted back to native Python data objects. Alternatively, users can specify JSON or XML as annotation format. To ensure robustness, while the client is being used, Stanza also periodically checks the health of the CoreNLP server, and restarts it if necessary.
Stanza’s user interface is designed to allow quick out-of-the-box processing of multilingual text. To achieve this, Stanza supports automated model download via Python code and customization of pipeline with processors of choice. Moreover, annotation results can be accessed as native Python objects to allow for flexible post-processing.
Stanza’s neural NLP pipeline can be initialized with the Pipeline class, taking language name as an argument. By default, all processors will be loaded and run over the input text; however, users can also specify the processors to load and run with a list of processor names as an argument. Users can additionally specify other processor-level properties, such as batch sizes used by processors, at initialization time.
Stanza is also designed to be run on different hardware devices. By default, CUDA devices will be used whenever they are visible by the pipeline, or otherwise CPUs will be used. However, users can force all computation to be run on CPUs by setting use_gpu=False at initialization time.
The following code snippet shows a minimal usage of Stanza for downloading the Chinese model, annotating a Chinese sentence with customized processors, and printing out all annotations:
After all processors are run, a Document instance will be returned, which stores all annotation results. Within a Document, annotations are further stored in Sentences, Tokens and Words in a top-down fashion (Figure 1). The following code snippet demonstrates how to access the text and POS tag of each word in a document and all named entities in the document:
The CoreNLP client interface is designed in a way that the actual communication with the backend CoreNLP server is transparent to the user. To annotate an input text with the CoreNLP client, a CoreNLPClient instance needs to be initialized, with an optional list of CoreNLP annotators. After the annotation is complete, results will be accessible as native Python objects.
The following code snippet demonstrates how to establish a CoreNLP client and obtain the NER and coreference resolution annotations of an English sentence:
With the client interface users can annotate text in 6 languages as supported by CoreNLP.
To help visualize documents and their annotations generated by Stanza, we build an interactive web demo that runs the pipeline interactively. For all languages and all annotations Stanza provide in those languages, we generate predictions from the models trained on the largest treebank/NER dataset, and visualize the result with the Brat rapid annotation tool.444https://brat.nlplab.org/ This demo runs in a server/client architecture, and annotation is performed purely on the server side. We make one instance of this interactive demo publicly available at http://stanza.run/, which can also be run locally with proper Python libraries installed. An example of running Stanza on a German sentence can be found in Figure 3.
For all neural processors, Stanza provides command-line interfaces for users to train their own customized models. To do this, users need to prepare the training and development data in compatible formats (i.e., CoNLL-U format for the Universal Dependencies pipeline and BIO format column files for the NER model). The following command trains a neural dependency parser with user-specified training and development data:
|Overall (100 treebanks)||Ours||99.09||86.05||98.63||92.49||91.80||89.93||92.78||80.45||75.68|
To establish benchmark results and compare with other popular toolkits, we trained and evaluated Stanza on a total of 112 datasets. All pretrained models will be made publicly downloadable.
We train and evaluate Stanza’s tokenizer/sentence splitter, MWT expander, POS/UFeats tagger, lemmatizer, and dependency parser with the Universal Dependencies v2.5 treebanks Zeman et al. (2019). For training we use 100 treebanks from this release that have non-copyrighted training data, and for treebanks that do not include development data, we randomly split out 20% of the training data as development data. These treebanks represent 66 languages spanning a diversity of language families, including Indo-European, Afro-Asiatic, Uralic, Turkic, Sino-Tibetan, etc. For NER, we train and evaluate Stanza with 12 publicly available datasets covering 8 major languages as shown in Table 3 Nothman et al. (2013); Sang and De Meulder (2003); Tjong Kim Sang (2002); Benikova et al. (2014); Mohit et al. (2012); Taulé et al. (2008); Weischedel et al. (2013). For the WikiNER corpora, as canonical splits are not available, we randomly split them into 70% training, 15% dev and 15% test splits. For all other corpora we used their canonical splits.
On the Universal Dependencies treebanks, we tuned all hyper-parameters on several large treebanks and applied them to all other treebanks. We used the word2vec embeddings released as part of the 2018 UD Shared Task Zeman et al. (2018), or the fastText embeddings Bojanowski et al. (2017) whenever word2vec is not available. For the character-level language models in the NER component, we pretrained them on a mix of the Common Crawl and Wikipedia dumps, and the news corpora released by the WMT19 Shared Task Barrault et al. (2019), with exceptions of English and Chinese, for which we pretrained on the Google One Billion Word Chelba et al. (2013) and the Chinese Gigaword corpora555https://catalog.ldc.upenn.edu/LDC2011T13, respectively. We again applied the same hyper-parameters to models of all languages.
For performance on UD treebanks, we compared our system against UDPipe and spaCy on treebanks of 5 major languages whenever a pretrained model is available. As shown in Table 2, Stanza achieved the best performance on most scores reported. Notably, we find that Stanza’s language-agnostic pipeline architecture is able to adapt to datasets of different languages and genres. This is also shown by Stanza’s high macro-averaged scores over 100 treebanks covering 66 languages.
For performance of the NER component, we compared our system against Flair and spaCy. For spaCy we reported results from its publicly available pretrained model whenever one trained on the same dataset can be found, otherwise we retrained its model on our datasets with default hyper-parameters. For Flair, since their downloadable models were pretrained on dataset versions different from canonical ones, we retrained all models on our own dataset splits with their best reported hyper-parameters. All test results are shown in Table 3. We find that on all datasets Stanza achieved either higher or close scores when compared against Flair, which is heavily tuned for sequence tagging tasks. When compared to spaCy, Stanza’s NER performance is much better. It is worth noting that Stanza’s high performance is achieved with much smaller NER models compared with Flair (up to 75% smaller), as we intentionally compressed the models for memory efficiency and ease of distribution.
In this paper we introduced Stanza, a Python natural language processing toolkit supporting many human languages. We have demonstrated that Stanza’s neural pipeline is not only in its wide coverage of human languages, but also accurate on all tasks, thanks to its language-agnostic, fully neural architectural design. On the other hand, Stanza’s CoreNLP client extends its functionalities with comprehensive NLP tools that were previously unavailable in Python.
For future work, we consider the following areas of improvement in the near term:
Models downloadable in Stanza right now are largely trained on a single dataset. To make available models that are robust to many different genres of text, we would like to investigate the possibility of pooling various sources of compatible data to train “default” models for each language;
The amount of computation and resources available to us is limited. We would therefore like to build an open “model zoo” for Stanza, so that researchers from outside our group can also contribute their models (built potentially on their own data) and benefit from models released by others;
Stanza has been designed to optimize for accuracy of its predictions, but this sometimes comes at the cost of computational efficiency and limits the toolkit’s use. We would like to further investigate reducing model sizes and speeding up computation in the toolkit, while still maintaining the same level of accuracy on various tasks.
We would also like to expand Stanza’s coverage of functionalities by implementing other processors such as neural coreference resolution or relation extraction for richer text analytics.