Towards Flexible Multi-modal Document Models

03/31/2023
by   Naoto Inoue, et al.
0

Creative workflows for generating graphical documents involve complex inter-related tasks, such as aligning elements, choosing appropriate fonts, or employing aesthetically harmonious colors. In this work, we attempt at building a holistic model that can jointly solve many different design tasks. Our model, which we denote by FlexDM, treats vector graphic documents as a set of multi-modal elements, and learns to predict masked fields such as element type, position, styling attributes, image, or text, using a unified architecture. Through the use of explicit multi-task learning and in-domain pre-training, our model can better capture the multi-modal relationships among the different document fields. Experimental results corroborate that our single FlexDM is able to successfully solve a multitude of different design tasks, while achieving performance that is competitive with task-specific and costly baselines.

READ FULL TEXT

page 2

page 7

page 8

research
09/30/2020

Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning

In this paper, we propose a multi-task learning-based framework that uti...
research
05/30/2023

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Visually-rich Document Understanding (VrDU) has attracted much research ...
research
02/05/2021

Metaknowledge Extraction Based on Multi-Modal Documents

The triple-based knowledge in large-scale knowledge bases is most likely...
research
08/03/2021

CanvasVAE: Learning to Generate Vector Graphic Documents

Vector graphic documents present visual elements in a resolution free, c...
research
07/09/2021

Multi-Modal Association based Grouping for Form Structure Extraction

Document structure extraction has been a widely researched area for deca...
research
01/27/2021

Multi-Modal Aesthetic Assessment for MObile Gaming Image

With the proliferation of various gaming technology, services, game styl...
research
07/17/2023

Unified Open-Vocabulary Dense Visual Prediction

In recent years, open-vocabulary (OV) dense visual prediction (such as O...

Please sign up or login with your details

Forgot password? Click here to reset