All are Worth Words: a ViT Backbone for Score-based Diffusion Models

09/25/2022
by   Fan Bao, et al.
0

Vision transformers (ViT) have shown promise in various vision tasks including low-level ones while the U-Net remains dominant in score-based diffusion models. In this paper, we perform a systematical empirical study on the ViT-based architectures in diffusion models. Our results suggest that adding extra long skip connections (like the U-Net) to ViT is crucial to diffusion models. The new ViT architecture, together with other improvements, is referred to as U-ViT. On several popular visual datasets, U-ViT achieves competitive generation results to SOTA U-Net while requiring comparable amount of parameters and computation if not less.

READ FULL TEXT
research
12/28/2022

Exploring Vision Transformers as Diffusion Learners

Score-based diffusion models have captured widespread attention and fund...
research
09/20/2023

FreeU: Free Lunch in Diffusion U-Net

In this paper, we uncover the untapped potential of diffusion U-Net, whi...
research
11/14/2021

HAD-Net: Hybrid Attention-based Diffusion Network for Glucose Level Forecast

Data-driven models for glucose level forecast often do not provide meani...
research
05/22/2023

U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

Deep learning has led to considerable advances in text-to-speech synthes...
research
12/19/2022

Scalable Diffusion Models with Transformers

We explore a new class of diffusion models based on the transformer arch...
research
01/09/2022

MAXIM: Multi-Axis MLP for Image Processing

Recent progress on Transformers and multi-layer perceptron (MLP) models ...
research
04/18/2019

An Efficient Approximate kNN Graph Method for Diffusion on Image Retrieval

The application of the diffusion in many computer vision and artificial ...

Please sign up or login with your details

Forgot password? Click here to reset