Beyond Document Page Classification: Design, Datasets, and Challenges

08/24/2023
by   Jordy Van Landeghem, et al.
0

This paper highlights the need to bring document classification benchmarking closer to real-world applications, both in the nature of data tested (X: multi-channel, multi-paged, multi-industry; Y: class distributions and label set variety) and in classification tasks considered (f: multi-page document, page stream, and document bundle classification, ...). We identify the lack of public multi-page document classification datasets, formalize different classification tasks arising in application scenarios, and motivate the value of targeting efficient multi-page document representations. An experimental study on proposed multi-page document classification datasets demonstrates that current benchmarks have become irrelevant and need to be updated to evaluate complete documents, as they naturally occur in practice. This reality check also calls for more mature evaluation methodologies, covering calibration evaluation, inference complexity (time-memory), and a range of realistic distribution shifts (e.g., born-digital vs. scanning noise, shifting page order). Our study ends on a hopeful note by recommending concrete avenues for future improvements.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/09/2019

Modular Multimodal Architecture for Document Classification

Page classification is a crucial component to any document analysis syst...
research
05/15/2023

Document Understanding Dataset and Evaluation (DUDE)

We call on the Document AI (DocAI) community to reevaluate current metho...
research
05/26/2022

Semantic Parsing of Interpage Relations

Page-level analysis of documents has been a topic of interest in digitiz...
research
10/09/2017

Page Stream Segmentation with Convolutional Neural Nets Combining Textual and Visual Features

For digitization of paper files via OCR, preservation of document contex...
research
07/15/2020

Evaluation of Neural Network Classification Systems on Document Stream

One major drawback of state of the art Neural Networks (NN)-based approa...
research
03/21/2022

Efficient Classification of Long Documents Using Transformers

Several methods have been proposed for classifying long textual document...
research
03/15/2020

Multistage Curvilinear Coordinate Transform Based Document Image Dewarping using a Novel Quality Estimator

The present work demonstrates a fast and improved technique for dewarpin...

Please sign up or login with your details

Forgot password? Click here to reset