Multimodal deep networks for text and image-based document classification

07/15/2019 ∙ by Nicolas Audebert, et al. ∙ QUICKSIGN 16

Classification of document images is a critical step for archival of old manuscripts, online subscription and administrative procedures. Computer vision and deep learning have been suggested as a first solution to classify documents based on their visual appearance. However, achieving the fine-grained classification that is required in real-world setting cannot be achieved by visual analysis alone. Often, the relevant information is in the actual text content of the document. We design a multimodal neural network that is able to learn from word embeddings, computed on text extracted by OCR, and from the image. We show that this approach boosts pure image accuracy by 3 Tobacco3482 and RVL-CDIP augmented by our new QS-OCR text dataset (, even without clean text information.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Document image classification on the Tobacco-3482 dataset using multi-modal CNNs

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Multimodal classifier for hybrid text/image classification. Training is performed end-to-end on both textual and visual features.

[xleftmargin=1mm,fontsize=] PHILIP MORRIS U.S.A. koe INTER - OFFICE CORRESPONDENCE 100 Park Avenua, New York, N.Y. 10017 Distribution Date: May 2, 1979 R, Atlas Competitive Activity

Section Sales Manager Artie Glacberman reports the following: …

(b) Memo

[xleftmargin=1mm,fontsize=] Orginal Message-— From: McCarthy, Josnne Sent. ‘Thursday, June 24, 1990 10:47 AM Tor Met, Eten; Daragan, Karon M; McCormick, Brendan; Femandez, Henry L. (Ge: Fisher, Scot: Lon Jock: Gamavalo, Mary ; Pol, Michael. Ley, Ceo J Subject: RE: MA™-AG out and Dr. Connolys remarks ‘Any follow up on this with Connolly and his remarks relative to our Lawsuit? Joanne Oztrlseezloz

(c) Email


Creating the Future Good Organization

Lower Improve Periodic Employees e Cost Quality New Products Do Their Job


(d) Presentation
(a) Questionnaire
(e) Document samples from the RVL-CDIP [1] dataset with corresponding text extracted by Tesseract OCR.
(a) Questionnaire