Learning Multimodal Affinities for Textual Editing in Images

03/18/2021
by   Or Perel, et al.
0

Nowadays, as cameras are rapidly adopted in our daily routine, images of documents are becoming both abundant and prevalent. Unlike natural images that capture physical objects, document-images contain a significant amount of text with critical semantics and complicated layouts. In this work, we devise a generic unsupervised technique to learn multimodal affinities between textual entities in a document-image, considering their visual style, the content of their underlying text and their geometric context within the image. We then use these learned affinities to automatically cluster the textual entities in the image into different semantic groups. The core of our approach is a deep optimization scheme dedicated for an image provided by the user that detects and leverages reliable pairwise connections in the multimodal representation of the textual elements in order to properly learn the affinities. We show that our technique can operate on highly varying images spanning a wide range of documents and demonstrate its applicability for various editing operations manipulating the content, appearance and geometry of the image.

READ FULL TEXT

page 2

page 3

page 4

page 5

page 8

page 9

page 14

page 15

research
04/16/2019

Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents

Images and text co-occur everywhere on the web, but explicit links betwe...
research
07/15/2019

Multimodal deep networks for text and image-based document classification

Classification of document images is a critical step for archival of old...
research
02/01/2021

Deep Learning-based Forgery Attack on Document Images

With the ongoing popularization of online services, the digital document...
research
02/14/2020

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

The massive amounts of digitized historical documents acquired over the ...
research
08/29/2019

KBSET -- Knowledge-Based Support for Scholarly Editing and Text Processing

KBSET supports a practical workflow for scholarly editing, based on usin...
research
02/28/2023

Audio Retrieval for Multimodal Design Documents: A New Dataset and Algorithms

We consider and propose a new problem of retrieving audio files relevant...
research
06/21/2018

Don't only Feel Read: Using Scene text to understand advertisements

We propose a framework for automated classification of Advertisement Ima...

Please sign up or login with your details

Forgot password? Click here to reset