Inscriptis – A Python-based HTML to text conversion library optimized for knowledge extraction from the Web

by   Albert Weichselbraun, et al.

Inscriptis provides a library, command line client and Web service for converting HTML to plain text. Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium. In contrast to existing software packages such as HTML2text, jusText and Lynx, Inscriptis (i) provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements. Inscriptis excels in terms of conversion quality, since it correctly converts complex HTML constructs such as nested tables and also interprets a subset of HTML attributes that determine the text alignment. In addition, it (ii) supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These unique features ensure that downstream knowledge extraction components can operate on accurate text representations, and may even use information on the semantics and structure of the original HTML document, if annotation support has been enabled.


Information Extraction from Scanned Invoice Images using Text Analysis and Layout Features

While storing invoice content as metadata to avoid paper document proces...

WYSIWYE: An Algebra for Expressing Spatial and Textual Rules for Visual Information Extraction

The visual layout of a webpage can provide valuable clues for certain ty...

Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration

Building document-grounded dialogue systems have received growing intere...

Text Annotation Graphs: Annotating Complex Natural Language Phenomena

This paper introduces a new web-based software tool for annotating text,...

Discourse in Multimedia: A Case Study in Information Extraction

To ensure readability, text is often written and presented with due form...

Marvin: Semantic annotation using multiple knowledge sources

People are producing more written material then anytime in the history. ...

Development of a Predictive Process Design kit for15-nm FinFETs: FreePDK15

FinFETs are predicted to advance semiconductorscaling for sub-20nm devic...

Please sign up or login with your details

Forgot password? Click here to reset