We present InstructDiffusion, a unifying and generic framework for align...
Recent vision transformers, large-kernel CNNs and MLPs have attained
rem...
While generative modeling has been ubiquitous in natural language proces...
Denoising diffusion models have been a mainstream approach for image
gen...
Masked image modeling (MIM) learns representations with remarkably good
...
Image classification, which classifies images by pre-defined categories,...
In this work we propose Identity Consistency Transformer, a novel face
f...
Despite the tantalizing success in a broad of vision tasks, transformers...
We present the vector quantized diffusion (VQ-Diffusion) model for
text-...
We study joint video and language (VL) pre-training to enable cross-moda...
We present techniques for scaling Swin Transformer up to 3 billion param...
We present CSWin Transformer, an efficient and effective Transformer-bas...
State-of-the-art image inpainting approaches can suffer from generating
...
This paper presents a new vision Transformer, called Swin Transformer, t...
DeepFake detection has so far been dominated by “artifact-driven” method...
We study on image super-resolution (SR), which aims to recover realistic...
In this paper we propose a novel image representation called face X-ray ...
High-quality image inpainting requires filling missing regions in a dama...
With the growing popularity of short-form video sharing platforms such a...
In this paper, we propose a new data structure for approximate nearest
n...
Uncertainty plays a central role in spoken dialogue systems. Some stocha...