Evaluation of Neural Network Classification Systems on Document Stream

07/15/2020
by   Joris Voerman, et al.
0

One major drawback of state of the art Neural Networks (NN)-based approaches for document classification purposes is the large number of training samples required to obtain an efficient classification. The minimum required number is around one thousand annotated documents for each class. In many cases it is very difficult, if not impossible, to gather this number of samples in real industrial processes. In this paper, we analyse the efficiency of NN-based document classification systems in a sub-optimal training case, based on the situation of a company document stream. We evaluated three different approaches, one based on image content and two on textual content. The evaluation was divided into four parts: a reference case, to assess the performance of the system in the lab; two cases that each simulate a specific difficulty linked to document stream processing; and a realistic case that combined all of these difficulties. The realistic case highlighted the fact that there is a significant drop in the efficiency of NN-Based document classification systems. Although they remain efficient for well represented classes (with an over-fitting of the system for those classes), it is impossible for them to handle appropriately less well represented classes. NN-Based document classification systems need to be adapted to resolve these two problems before they can be considered for use in a company document stream.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/06/2017

Czech Text Document Corpus v 2.0

This paper introduces "Czech Text Document Corpus v 2.0", a collection o...
research
10/09/2017

Page Stream Segmentation with Convolutional Neural Nets Combining Textual and Visual Features

For digitization of paper files via OCR, preservation of document contex...
research
12/14/2020

Application of the Neural Network Dependability Kit in Real-World Environments

In this paper, we provide a guideline for using the Neural Network Depen...
research
08/24/2023

Beyond Document Page Classification: Design, Datasets, and Challenges

This paper highlights the need to bring document classification benchmar...
research
01/05/2021

Domain Generalization for Document Authentication against Practical Recapturing Attacks

Recapturing attack can be employed as a simple but effective anti-forens...
research
03/26/2020

Robust Classification of High-Dimensional Spectroscopy Data Using Deep Learning and Data Synthesis

This paper presents a new approach to classification of high dimensional...
research
01/27/2023

Détection d'Objets dans les documents numérisés par réseaux de neurones profonds

In this thesis, we study multiple tasks related to document layout analy...

Please sign up or login with your details

Forgot password? Click here to reset