Zhe Gan

research

∙ 09/18/2023

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

This paper presents a comprehensive survey of the taxonomy and evolution...

0 Chunyuan Li, et al. ∙

research

∙ 04/28/2023

An Empirical Study of Multimodal Model Merging

Model merging (e.g., via interpolation or task arithmetic) fuses multipl...

3 Yi-Lin Sung, et al. ∙

research

∙ 04/13/2023

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

Spatial control is a core capability in controllable image generation. A...

4 Jaemin Cho, et al. ∙

research

∙ 12/21/2022

Generalized Decoding for Pixel, Image, and Language

We present X-Decoder, a generalized decoding model that can predict pixe...

10 Xueyan Zou, et al. ∙

research

∙ 12/01/2022

GRiT: A Generative Region-to-text Transformer for Object Understanding

This paper presents a Generative RegIon-to-Text transformer, GRiT, for o...

0 Jialian Wu, et al. ∙

research

∙ 11/21/2022

Exploring Discrete Diffusion Models for Image Captioning

The image captioning task is typically realized by an auto-regressive me...

0 Zixin Zhu, et al. ∙

research

∙ 10/17/2022

Non-Contrastive Learning Meets Language-Image Pre-Training

Contrastive language-image pre-training (CLIP) serves as a de-facto stan...

1 Jinghao Zhou, et al. ∙

research

∙ 10/17/2022

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

This paper surveys vision-language pre-training (VLP) methods for multim...

0 Zhe Gan, et al. ∙

research

∙ 10/17/2022

Prompting GPT-3 To Be Reliable

Large language models (LLMs) show impressive abilities via few-shot prom...

0 Chenglei Si, et al. ∙

research

∙ 09/04/2022

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Masked visual modeling (MVM) has been recently proven effective for visu...

8 Tsu-Jui Fu, et al. ∙

research

∙ 07/20/2022

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

In this paper, we present NUWA-Infinity, a generative model for infinite...

4 Chenfei Wu, et al. ∙

research

∙ 06/15/2022

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Vision-language (VL) pre-training has recently received considerable att...

13 Zi-Yi Dou, et al. ∙

research

∙ 06/14/2022

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Unified vision-language frameworks have greatly advanced in recent years...

9 Linjie Li, et al. ∙

research

∙ 05/27/2022

GIT: A Generative Image-to-text Transformer for Vision and Language

In this paper, we design and train a Generative Image-to-text Transforme...

14 Jianfeng Wang, et al. ∙

research

∙ 04/20/2022

K-LITE: Learning Transferable Visual Models with External Knowledge

Recent state-of-the-art computer vision systems are trained from natural...

3 Sheng Shen, et al. ∙

research

∙ 12/09/2021

Injecting Semantic Concepts into End-to-End Image Captioning

Tremendous progress has been made in recent years in developing better i...

0 Zhiyuan Fang, et al. ∙

research

∙ 12/08/2021

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

We initiate the first empirical study on the use of MLP architectures fo...

2 Yixin Nie, et al. ∙

research

∙ 11/25/2021

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

The canonical approach to video captioning dictates a caption generation...

29 Kevin Lin, et al. ∙

research

∙ 11/24/2021

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

A great challenge in video-language (VidL) modeling lies in the disconne...

19 Tsu-Jui Fu, et al. ∙

research

∙ 11/24/2021

Scaling Up Vision-Language Pre-training for Image Captioning

In recent years, we have witnessed significant performance boost in the ...

0 Xiaowei Hu, et al. ∙

research

∙ 11/23/2021

Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

In this paper, we propose UNICORN, a vision-language (VL) model that uni...

7 Zhengyuan Yang, et al. ∙

research

∙ 11/19/2021

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

In this paper, we propose a single UniFied transfOrmer (UFO), which is c...

0 Jianfeng Wang, et al. ∙

research

∙ 11/04/2021

Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Large-scale pre-trained language models have achieved tremendous success...

0 Boxin Wang, et al. ∙

research

∙ 11/03/2021

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Vision-and-language (VL) pre-training has proven to be highly effective ...

0 Zi-Yi Dou, et al. ∙

research

∙ 09/10/2021

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Knowledge-based visual question answering (VQA) involves answering quest...

0 Zhengyuan Yang, et al. ∙

research

∙ 06/08/2021

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Most existing video-and-language (VidL) research focuses on a single dat...

3 Linjie Li, et al. ∙

research

∙ 06/08/2021

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Vision transformers (ViTs) have recently received explosive popularity, ...

0 Tianlong Chen, et al. ∙

research

∙ 06/01/2021

Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models

With large-scale pre-training, the past two years have witnessed signifi...

22 Linjie Li, et al. ∙

research

∙ 04/23/2021

Playing Lottery Tickets with Vision and Language

Large-scale transformer-based pre-training has recently revolutionized v...

11 Zhe Gan, et al. ∙

research

∙ 04/01/2021

CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning

This work concerns video-language pre-training and representation learni...

0 Luowei Zhou, et al. ∙

research

∙ 03/30/2021

The Elastic Lottery Ticket Hypothesis

Lottery Ticket Hypothesis raises keen attention to identifying sparse tr...

0 Xiaohan Chen, et al. ∙

research

∙ 03/22/2021

Adversarial Feature Augmentation and Normalization for Visual Recognition

Recent advances in computer vision take advantage of adversarial data au...

14 Tianlong Chen, et al. ∙

research

∙ 03/17/2021

Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning

Voice style transfer, also called voice conversion, seeks to modify one ...

0 Siyang Yuan, et al. ∙

research

∙ 02/28/2021

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Training generative adversarial networks (GANs) with limited data genera...

0 Tianlong Chen, et al. ∙

research

∙ 02/11/2021

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

The canonical approach to video-and-language learning (e.g., video quest...

2 Jie Lei, et al. ∙

research

∙ 12/31/2020

EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets

Deep, heavily overparameterized language models such as BERT, XLNet and ...

0 Xiaohan Chen, et al. ∙

research

∙ 12/15/2020

Wasserstein Contrastive Representation Distillation

The primary goal of knowledge distillation (KD) is to encapsulate the in...

0 Liqun Chen, et al. ∙

research

∙ 12/15/2020

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

Large-scale pre-trained multimodal transformers, such as ViLBERT and UNI...

1 Linjie Li, et al. ∙

research

∙ 10/07/2020

Cross-Thought for Sentence Encoder Pre-training

In this paper, we propose Cross-Thought, a novel approach to pre-trainin...

0 Shuohang Wang, et al. ∙

research

∙ 10/06/2020

Multi-Fact Correction in Abstractive Text Summarization

Pre-trained neural abstractive summarization systems have dominated extr...

0 Yue Dong, et al. ∙

research

∙ 10/05/2020

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

Large-scale language models such as BERT have achieved state-of-the-art ...

0 Boxin Wang, et al. ∙

research

∙ 10/03/2020

Efficient Robust Training via Backward Smoothing

Adversarial training is so far the most effective strategy in defending ...

1 Jinghui Chen, et al. ∙

research

∙ 09/29/2020

Contrastive Distillation on Intermediate Representations for Language Model Compression

Existing language model compression methods mostly use a simple L2 loss ...

0 Siqi Sun, et al. ∙

research

∙ 09/13/2020

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

Transformer has become ubiquitous in the deep learning field. One of the...

0 Shuohang Wang, et al. ∙

research

∙ 09/10/2020

Accelerating Real-Time Question Answering via Question Generation

Existing approaches to real-time question answering (RTQA) rely on learn...

0 Yuwei Fang, et al. ∙

research

∙ 09/10/2020

FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding

Large-scale cross-lingual language models (LM), such as mBERT, Unicoder ...

0 Yuwei Fang, et al. ∙

research

∙ 06/26/2020

Graph Optimal Transport for Cross-Domain Alignment

Cross-domain alignment between two sets of entities (e.g., objects in an...

9 Liqun Chen, et al. ∙

research

∙ 06/22/2020

CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information

Mutual information (MI) minimization has gained considerable interests i...

38 Pengyu Cheng, et al. ∙

research

∙ 06/21/2020

Adaptive Learning Rates with Maximum Variation Averaging

Adaptive gradient methods such as RMSProp and Adam use exponential movin...

19 Chen Zhu, et al. ∙

research

∙ 06/11/2020

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

We present VILLA, the first known effort on large-scale adversarial trai...

6 Zhe Gan, et al. ∙

Zhe Gan

Featured Co-authors

Sign in with Google

Consider DeepAI Pro