Unfolding the Structure of a Document using Deep Learning

09/29/2019
by   Muhammad Mahbubur Rahman, et al.
0

Understanding and extracting of information from large documents, such as business opportunities, academic articles, medical documents and technical reports, poses challenges not present in short documents. Such large documents may be multi-themed, complex, noisy and cover diverse topics. We describe a framework that can analyze large documents and help people and computer systems locate desired information in them. We aim to automatically identify and classify different sections of documents and understand their purpose within the document. A key contribution of our research is modeling and extracting the logical and semantic structure of electronic documents using deep learning techniques. We evaluate the effectiveness and robustness of our framework through extensive experiments on two collections: more than one million scholarly articles from arXiv and a collection of requests for proposal documents from government sources.

READ FULL TEXT
research
09/03/2017

Understanding the Logical and Semantic Structure of Large Documents

Current language understanding approaches focus on small documents, such...
research
07/24/2018

Understanding and representing the semantics of large structured documents

Understanding large, structured documents like scholarly articles, reque...
research
09/02/2020

Identifying Documents In-Scope of a Collection from Web Archives

Web archive data usually contains high-quality documents that are very u...
research
10/20/2020

Extracting Procedural Knowledge from Technical Documents

Procedures are an important knowledge component of documents that can be...
research
10/14/2022

SealClub: Computer-aided Paper Document Authentication

Digital authentication is a mature field, offering a range of solutions ...
research
12/01/2022

Long-Document Cross-Lingual Summarization

Cross-Lingual Summarization (CLS) aims at generating summaries in one la...
research
11/26/2019

Doc2Vec on the PubMed corpus: study of a new approach to generate related articles

PubMed is the biggest and most used bibliographic database worldwide, ho...

Please sign up or login with your details

Forgot password? Click here to reset