A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing

10/14/2022
by   Amir Zeldes, et al.
0

Foundational Hebrew NLP tasks such as segmentation, tagging and parsing, have relied to date on various versions of the Hebrew Treebank (HTB, Sima'an et al. 2001). However, the data in HTB, a single-source newswire corpus, is now over 30 years old, and does not cover many aspects of contemporary Hebrew on the web. This paper presents a new, freely available UD treebank of Hebrew stratified from a range of topics selected from Hebrew Wikipedia. In addition to introducing the corpus and evaluating the quality of its annotations, we deploy automatic validation tools based on grew (Guillaume, 2021), and conduct the first cross domain parsing experiments in Hebrew. We obtain new state-of-the-art (SOTA) results on UD NLP tasks, using a combination of the latest language modelling and some incremental improvements to existing transformer based approaches. We also release a new version of the UD HTB matching annotation scheme updates from our new corpus.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/26/2019

SciBERT: Pretrained Contextualized Embeddings for Scientific Text

Obtaining large-scale annotated data for NLP tasks in the scientific dom...
research
08/01/2017

A Continuously Growing Dataset of Sentential Paraphrases

A major challenge in paraphrase research is the lack of parallel corpora...
research
12/18/2021

The Web Is Your Oyster – Knowledge-Intensive NLP against a Very Large Web Corpus

In order to address the increasing demands of real-world applications, t...
research
10/22/2022

Cross-domain Generalization for AMR Parsing

Abstract Meaning Representation (AMR) parsing aims to predict an AMR gra...
research
09/21/2021

Something Old, Something New: Grammar-based CCG Parsing with Transformer Models

This report describes the parsing problem for Combinatory Categorial Gra...
research
09/29/2021

EDGAR-CORPUS: Billions of Tokens Make The World Go Round

We release EDGAR-CORPUS, a novel corpus comprising annual reports from a...
research
10/11/2022

Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts

Although several datasets annotated for anaphoric reference/coreference ...

Please sign up or login with your details

Forgot password? Click here to reset