Luowei Zhou

research

∙ 06/14/2023

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

Recent research on Large Language Models (LLMs) has led to remarkable ad...

0 Difei Gao, et al. ∙

research

∙ 03/29/2023

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

The development of language models have moved from encoder-decoder to de...

0 Weicheng Kuo, et al. ∙

research

∙ 12/19/2022

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

To build Video Question Answering (VideoQA) systems capable of assisting...

0 Difei Gao, et al. ∙

research

∙ 09/15/2022

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

This paper presents OmniVL, a new foundation model to support both image...

27 Junke Wang, et al. ∙

research

∙ 07/26/2022

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Large-scale multi-modal contrastive pre-training has demonstrated great ...

7 Haoxuan You, et al. ∙

research

∙ 06/03/2022

Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning

People say, "A picture is worth a thousand words". Then how can we get t...

0 Yujia Xie, et al. ∙

research

∙ 05/22/2022

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

The goal of this work is to build flexible video-language models that ca...

11 Zhenhailong Wang, et al. ∙

research

∙ 04/22/2022

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

Cross-modal encoders for vision-language (VL) tasks are often pretrained...

3 Zhecan Wang, et al. ∙

research

∙ 01/15/2022

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

Contrastive language-image pretraining (CLIP) links vision and language ...

15 Zhecan Wang, et al. ∙

research

∙ 01/13/2022

CLIP-Event: Connecting Text and Images with Event Structures

Vision-language (V+L) pretraining models have achieved great success in ...

6 Manling Li, et al. ∙

research

∙ 12/16/2021

RegionCLIP: Region-based Language-Image Pretraining

Contrastive language-image pretraining (CLIP) using image-text pairs has...

2 Yiwu Zhong, et al. ∙

research

∙ 11/22/2021

Florence: A New Foundation Model for Computer Vision

Automated visual understanding of our diverse and open world demands com...

4 Lu Yuan, et al. ∙

research

∙ 06/08/2021

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Most existing video-and-language (VidL) research focuses on a single dat...

3 Linjie Li, et al. ∙

research

∙ 04/01/2021

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Vision-and-language pre-training has achieved impressive success in lear...

2 Mingyang Zhou, et al. ∙

research

∙ 04/01/2021

CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning

This work concerns video-language pre-training and representation learni...

0 Luowei Zhou, et al. ∙

research

∙ 02/11/2021

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

The canonical approach to video-and-language learning (e.g., video quest...

2 Jie Lei, et al. ∙

research

∙ 01/12/2021

Temporally Guided Articulated Hand Pose Tracking in Surgical Videos

Articulated hand pose tracking is an underexplored problem that carries ...

1 Nathan Louis, et al. ∙

research

∙ 09/13/2020

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

Transformer has become ubiquitous in the deep learning field. One of the...

0 Shuohang Wang, et al. ∙

research

∙ 09/24/2019

Unified Vision-Language Pre-Training for Image Captioning and VQA

This paper presents a unified Vision-Language Pre-training (VLP) model. ...

17 Luowei Zhou, et al. ∙

research

∙ 12/17/2018

Grounded Video Description

Video description is one of the most challenging problems in vision and ...

8 Luowei Zhou, et al. ∙

research

∙ 12/13/2018

Dynamic Graph Modules for Modeling Higher-Order Interactions in Activity Recognition

Video action recognition, as a critical problem towards video understand...

4 Hao Huang, et al. ∙

research

∙ 05/08/2018

Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction

We study weakly-supervised video object grounding: given a video segment...

0 Luowei Zhou, et al. ∙

research

∙ 04/03/2018

End-to-End Dense Video Captioning with Masked Transformer

Dense video captioning aims to generate text descriptions for all events...

0 Luowei Zhou, et al. ∙

research

∙ 03/28/2017

Towards Automatic Learning of Procedures from Web Instructional Videos

The potential for agents, whether embodied or software, to learn by obse...

0 Luowei Zhou, et al. ∙

research

∙ 06/15/2016

Watch What You Just Said: Image Captioning with Text-Conditional Attention

Attention mechanisms have attracted considerable interest in image capti...

0 Luowei Zhou, et al. ∙

research

∙ 08/21/2015

Multi-agent Reinforcement Learning with Sparse Interactions by Negotiation and Knowledge Transfer

Reinforcement learning has significant applications for multi-agent syst...

0 Luowei Zhou, et al. ∙

Luowei Zhou

Featured Co-authors

Sign in with Google

Consider DeepAI Pro