VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach
We introduce a novel approach for scanned document representation to perform fields extraction task. It allows the simultaneous encoding of the textual, visual and layout information in a 3D matrix used as an input to a segmentation model. We improve the recent Chargrid and Wordgrid models in several directions, first by taking into account the visual modality, then by boosting its robustness toward small datasets, while keeping the inference time low. Our approach is tested on public and private document image datasets, showing higher performances compared to the recent state-of-the-art methods.
READ FULL TEXT