PAWLS: PDF Annotation With Labels and Structure

01/25/2021
by   Mark Neumann, et al.
0

Adobe's Portable Document Format (PDF) is a popular way of distributing view-only documents with a rich visual markup. This presents a challenge to NLP practitioners who wish to use the information contained within PDF documents for training models or data analysis, because annotating these documents is difficult. In this paper, we present PDF Annotation with Labels and Structure (PAWLS), a new annotation tool designed specifically for the PDF document format. PAWLS is particularly suited for mixed-mode annotation and scenarios in which annotators require extended context to annotate accurately. PAWLS supports span-based textual annotation, N-ary relations and freeform, non-textual bounding boxes, all of which can be exported in convenient formats for training multi-modal machine learning models. A read-only PAWLS server is available at https://pawls.apps.allenai.org/ and the source code is available at https://github.com/allenai/pawls.

READ FULL TEXT
research
06/01/2020

DocBank: A Benchmark Dataset for Document Layout Analysis

Document layout analysis usually relies on computer vision models to und...
research
10/13/2020

Annotationsaurus: A Searchable Directory of Annotation Tools

Manual annotation of textual documents is a necessary task when construc...
research
12/16/2022

POTATO: The Portable Text Annotation Tool

We present POTATO, the Portable text annotation tool, a free, fully open...
research
08/29/2019

HARE: a Flexible Highlighting Annotator for Ranking and Exploration

Exploration and analysis of potential data sources is a significant chal...
research
03/22/2021

RefactorHub: A Commit Annotator for Refactoring

It is necessary to gather real refactoring instances while conducting em...
research
03/24/2023

HRDoc: Dataset and Baseline Method Toward Hierarchical Reconstruction of Document Structures

The problem of document structure reconstruction refers to converting di...
research
02/22/2022

StickyLand: Breaking the Linear Presentation of Computational Notebooks

How can we better organize code in computational notebooks? Notebooks ha...

Please sign up or login with your details

Forgot password? Click here to reset