Handling tree-structured text: parsing directory pages

11/24/2021
by   Sarang Shrivastava, et al.
0

The determination of the reading sequence of text is fundamental to document understanding. This problem is easily solved in pages where the text is organized into a sequence of lines and vertical alignment runs the height of the page (producing multiple columns which can be read from left to right). We present a situation – the directory page parsing problem – where information is presented on the page in an irregular, visually-organized, two-dimensional format. Directory pages are fairly common in financial prospectuses and carry information about organizations, their addresses and relationships that is key to business tasks in client onboarding. Interestingly, directory pages sometimes have hierarchical structure, motivating the need to generalize the reading sequence to a reading tree. We present solutions to the problem of identifying directory pages and constructing the reading tree, using (learnt) classifiers for text segments and a bottom-up (right to left, bottom-to-top) traversal of segments. The solution is a key part of a production service supporting automatic extraction of organization, address and relationship information from client onboarding documents.

READ FULL TEXT

page 2

page 7

research
02/12/2019

Reading Protocol: Understanding what has been Read in Interactive Information Retrieval Tasks

In Interactive Information Retrieval (IIR) experiments the user's gaze m...
research
05/26/2022

Semantic Parsing of Interpage Relations

Page-level analysis of documents has been a topic of interest in digitiz...
research
04/05/2023

Context-Aware Classification of Legal Document Pages

For many business applications that require the processing, indexing, an...
research
06/24/2015

Unshredding of Shredded Documents: Computational Framework and Implementation

A shredded document D is a document whose pages have been cut into strip...
research
02/01/2022

WebFormer: The Web-page Transformer for Structure Information Extraction

Structure information extraction refers to the task of extracting struct...
research
07/31/2023

Workshop on Document Intelligence Understanding

Document understanding and information extraction include different task...
research
07/20/2021

Readability Research: An Interdisciplinary Approach

Readability is on the cusp of a revolution. Fixed text is becoming fluid...

Please sign up or login with your details

Forgot password? Click here to reset