b'Linjie Li'

research

∙ 09/18/2023

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

This paper presents a comprehensive survey of the taxonomy and evolution...

0 Chunyuan Li, et al. ∙

research

∙ 07/27/2023

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models

In this paper, we study the denoising diffusion probabilistic model (DDP...

0 Xin Yuan, et al. ∙

research

∙ 06/30/2023

DisCo: Disentangled Control for Referring Human Dance Generation in Real World

Generative AI has made significant strides in computer vision, particula...

0 Tan Wang, et al. ∙

research

∙ 06/26/2023

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Despite the promising progress in multi-modal tasks, current large multi...

0 Fuxiao Liu, et al. ∙

research

∙ 06/07/2023

MultiSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos

Multimodal summarization with multimodal output (MSMO) has emerged as a ...

0 Jielin Qiu, et al. ∙

research

∙ 04/28/2023

An Empirical Study of Multimodal Model Merging

Model merging (e.g., via interpolation or task arithmetic) fuses multipl...

3 Yi-Lin Sung, et al. ∙

research

∙ 04/13/2023

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

Spatial control is a core capability in controllable image generation. A...

4 Jaemin Cho, et al. ∙

research

∙ 04/12/2023

Adaptive Human Matting for Dynamic Videos

The most recent efforts in video matting have focused on eliminating tri...

0 Chung-Ching Lin, et al. ∙

research

∙ 03/25/2023

Equivariant Similarity for Vision-Language Foundation Models

This study explores the concept of equivariance in vision-language found...

0 Tan Wang, et al. ∙

research

∙ 03/20/2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

We propose MM-REACT, a system paradigm that integrates ChatGPT with a po...

0 Zhengyuan Yang, et al. ∙

research

∙ 02/21/2023

Learning 3D Photography Videos via Self-supervised Diffusion on Single Images

3D photography renders a static image into a video with appealing 3D vis...

0 Xiaodong Wang, et al. ∙

research

∙ 12/21/2022

Generalized Decoding for Pixel, Image, and Language

We present X-Decoder, a generalized decoding model that can predict pixe...

10 Xueyan Zou, et al. ∙

research

∙ 10/17/2022

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

This paper surveys vision-language pre-training (VLP) methods for multim...

0 Zhe Gan, et al. ∙

research

∙ 09/04/2022

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Masked visual modeling (MVM) has been recently proven effective for visu...

8 Tsu-Jui Fu, et al. ∙

research

∙ 06/15/2022

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Vision-language (VL) pre-training has recently received considerable att...

13 Zi-Yi Dou, et al. ∙

research

∙ 06/14/2022

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Unified vision-language frameworks have greatly advanced in recent years...

9 Linjie Li, et al. ∙

research

∙ 05/27/2022

GIT: A Generative Image-to-text Transformer for Vision and Language

In this paper, we design and train a Generative Image-to-text Transforme...

14 Jianfeng Wang, et al. ∙

research

∙ 05/03/2022

Cross-modal Representation Learning for Zero-shot Action Recognition

We present a cross-modal Transformer-based framework, which jointly enco...

4 Chung-Ching Lin, et al. ∙

research

∙ 12/08/2021

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

We initiate the first empirical study on the use of MLP architectures fo...

2 Yixin Nie, et al. ∙

research

∙ 11/25/2021

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

The canonical approach to video captioning dictates a caption generation...

29 Kevin Lin, et al. ∙

research

∙ 11/24/2021

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

A great challenge in video-language (VidL) modeling lies in the disconne...

19 Tsu-Jui Fu, et al. ∙

research

∙ 06/08/2021

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Most existing video-and-language (VidL) research focuses on a single dat...

3 Linjie Li, et al. ∙

research

∙ 06/01/2021

Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models

With large-scale pre-training, the past two years have witnessed signifi...

22 Linjie Li, et al. ∙

research

∙ 04/23/2021

Playing Lottery Tickets with Vision and Language

Large-scale transformer-based pre-training has recently revolutionized v...

11 Zhe Gan, et al. ∙

research

∙ 04/01/2021

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Vision-and-language pre-training has achieved impressive success in lear...

2 Mingyang Zhou, et al. ∙

research

∙ 03/16/2021

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Multimodal pre-training has propelled great advancement in vision-and-la...

13 Siqi Sun, et al. ∙

research

∙ 02/11/2021

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

The canonical approach to video-and-language learning (e.g., video quest...

2 Jie Lei, et al. ∙

research

∙ 12/15/2020

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

Large-scale pre-trained multimodal transformers, such as ViLBERT and UNI...

1 Linjie Li, et al. ∙

research

∙ 06/26/2020

Graph Optimal Transport for Cross-Domain Alignment

Cross-domain alignment between two sets of entities (e.g., objects in an...

9 Liqun Chen, et al. ∙

research

∙ 06/11/2020

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

We present VILLA, the first known effort on large-scale adversarial trai...

6 Zhe Gan, et al. ∙

research

∙ 05/01/2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

We present HERO, a Hierarchical EncodeR for Omni-representation learning...

3 Linjie Li, et al. ∙

research

∙ 10/08/2019

Meta Module Network for Compositional Visual Reasoning

There are two main lines of research on visual reasoning: neural module ...

0 Wenhu Chen, et al. ∙

research

∙ 09/25/2019

UNITER: Learning UNiversal Image-TExt Representations

Joint image-text embedding is the bedrock for most Vision-and-Language (...

0 Yen-Chun Chen, et al. ∙

research

∙ 03/29/2019

Relation-aware Graph Attention Network for Visual Question Answering

In order to answer semantically-complicated questions about an image, a ...

0 Linjie Li, et al. ∙

research

∙ 02/01/2019

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog

This paper presents Recurrent Dual Attention Network (ReDAN) for visual ...

0 Zhe Gan, et al. ∙

research

∙ 05/05/2017

Learning to see people like people

Humans make complex inferences on faces, ranging from objective properti...

0 Amanda Song, et al. ∙

Linjie Li

Featured Co-authors

Sign in with Google

Consider DeepAI Pro