Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment for Markup-to-Image Generation

08/02/2023
by   Guojin Zhong, et al.
0

The recently rising markup-to-image generation poses greater challenges as compared to natural image generation, due to its low tolerance for errors as well as the complex sequence and context correlations between markup and rendered image. This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment" (FSA-CDM), which introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation. Technically, we design a fine-grained cross-modal alignment module to well explore the sequence similarity between the two modalities for learning robust feature representations. To improve the generalization ability, we propose a contrast-augmented diffusion model to explicitly explore positive and negative samples by maximizing a novel contrastive variational objective, which is mathematically inferred to provide a tighter bound for the model's optimization. Moreover, the context-aware cross attention module is developed to capture the contextual information within markup language during the denoising process, yielding better noise prediction results. Extensive experiments are conducted on four benchmark datasets from different domains, and the experimental results demonstrate the effectiveness of the proposed components in FSA-CDM, significantly exceeding state-of-the-art performance by about 2 https://github.com/zgj77/FSACDM.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/11/2022

HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for Controllable Text-Driven Person Image Generation

Text-driven person image generation is an emerging and challenging task ...
research
07/29/2022

Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval

This paper investigates an open research problem of generating text-imag...
research
05/08/2018

Reasoning with Sarcasm by Reading In-between

Sarcasm is a sophisticated speech act which commonly manifests on social...
research
06/17/2022

CDNet: Contrastive Disentangled Network for Fine-Grained Image Categorization of Ocular B-Scan Ultrasound

Precise and rapid categorization of images in the B-scan ultrasound moda...
research
02/28/2023

Dissolving Is Amplifying: Towards Fine-Grained Anomaly Detection

Medical anomalous data normally contains fine-grained instance-wise addi...
research
12/16/2022

SADM: Sequence-Aware Diffusion Model for Longitudinal Medical Image Generation

Human organs constantly undergo anatomical changes due to a complex mix ...
research
04/10/2023

DDRF: Denoising Diffusion Model for Remote Sensing Image Fusion

Denosing diffusion model, as a generative model, has received a lot of a...

Please sign up or login with your details

Forgot password? Click here to reset