EAML: Ensemble Self-Attention-based Mutual Learning Network for Document Image Classification

05/11/2023
by   Souhail Bakkali, et al.
0

In the recent past, complex deep neural networks have received huge interest in various document understanding tasks such as document image classification and document retrieval. As many document types have a distinct visual style, learning only visual features with deep CNNs to classify document images have encountered the problem of low inter-class discrimination, and high intra-class structural variations between its categories. In parallel, text-level understanding jointly learned with the corresponding visual properties within a given document image has considerably improved the classification performance in terms of accuracy. In this paper, we design a self-attention-based fusion module that serves as a block in our ensemble trainable network. It allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage. Besides, we encourage mutual learning by transferring the positive knowledge between image and text modalities during the training stage. This constraint is realized by adding a truncated-Kullback-Leibler divergence loss Tr-KLD-Reg as a new regularization term, to the conventional supervised setting. To the best of our knowledge, this is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification. The experimental results illustrate the effectiveness of our approach in terms of accuracy for the single-modal and multi-modal modalities. Thus, the proposed ensemble self-attention-based mutual learning model outperforms the state-of-the-art classification results based on the benchmark RVL-CDIP and Tobacco-3482 datasets.

READ FULL TEXT

page 3

page 12

research
06/22/2021

DocFormer: End-to-End Transformer for Document Understanding

We present DocFormer – a multi-modal transformer based architecture for ...
research
12/25/2019

Multi-Modal Attention-based Fusion Model for Semantic Segmentation of RGB-Depth Images

The 3D scene understanding is mainly considered as a crucial requirement...
research
11/05/2019

Self-Attention and Ingredient-Attention Based Model for Recipe Retrieval from Image Queries

Direct computer vision based-nutrient content estimation is a demanding ...
research
02/28/2023

VQA with Cascade of Self- and Co-Attention Blocks

The use of complex attention modules has improved the performance of the...
research
09/18/2023

CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

Text-based Person Retrieval aims to retrieve the target person images gi...
research
04/02/2019

The Verbal and Non Verbal Signals of Depression -- Combining Acoustics, Text and Visuals for Estimating Depression Level

Depression is a serious medical condition that is suffered by a large nu...
research
01/05/2022

Synthesizing Tensor Transformations for Visual Self-attention

Self-attention shows outstanding competence in capturing long-range rela...

Please sign up or login with your details

Forgot password? Click here to reset