Shih-Fu Chang

research

∙ 07/03/2023

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging bec...

0 Rui Sun, et al. ∙

research

∙ 05/27/2023

Non-Sequential Graph Script Induction via Multimedia Grounding

Online resources such as WikiHow compile a wide range of scripts for per...

4 Yu Zhou, et al. ∙

research

∙ 04/07/2023

Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

Causal Video Question Answering (CVidQA) queries not only association or...

0 Hung-Ting Su, et al. ∙

research

∙ 03/29/2023

What, when, and where? – Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Spatio-temporal grounding describes the task of localizing events in spa...

0 Brian Chen, et al. ∙

research

∙ 03/25/2023

Supervised Masked Knowledge Distillation for Few-Shot Transformers

Vision Transformers (ViTs) emerge to achieve impressive performance on m...

0 Han Lin, et al. ∙

research

∙ 03/16/2023

DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection

Generalized few-shot object detection aims to achieve precise detection ...

0 Jiawei Ma, et al. ∙

research

∙ 01/06/2023

In Defense of Structural Symbolic Representation for Video Event-Relation Prediction

Understanding event relationships in videos requires a model to understa...

0 Andrew Lu, et al. ∙

research

∙ 12/14/2022

Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding

From a visual scene containing multiple people, human is able to disting...

0 Haoxuan You, et al. ∙

research

∙ 11/10/2022

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

Visual commonsense understanding requires Vision Language (VL) models to...

0 Zhecan Wang, et al. ∙

research

∙ 11/03/2022

Video Event Extraction via Tracking Visual States of Arguments

Video event extraction aims to detect salient events from a video and id...

0 Guang Yang, et al. ∙

research

∙ 10/22/2022

Weakly-Supervised Temporal Article Grounding

Given a long untrimmed video and natural language queries, video groundi...

0 Long Chen, et al. ∙

research

∙ 10/15/2022

Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy

In Video Question Answering (VideoQA), answering general questions about...

0 Shiyuan Huang, et al. ∙

research

∙ 07/26/2022

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Large-scale multi-modal contrastive pre-training has demonstrated great ...

7 Haoxuan You, et al. ∙

research

∙ 06/14/2022

Multimodal Event Graphs: Towards Event Centric Understanding of Multimodal World

Understanding how events described or shown in multimedia content relate...

0 Hammad A. Ayyubi, et al. ∙

research

∙ 06/05/2022

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Multi-channel video-language retrieval require models to understand info...

0 Xudong Lin, et al. ∙

research

∙ 05/22/2022

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

The goal of this work is to build flexible video-language models that ca...

11 Zhenhailong Wang, et al. ∙

research

∙ 04/22/2022

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

Cross-modal encoders for vision-language (VL) tasks are often pretrained...

3 Zhecan Wang, et al. ∙

research

∙ 04/16/2022

Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

We study multimodal few-shot object detection (FSOD) in this paper, usin...

0 Guangxing Han, et al. ∙

research

∙ 03/29/2022

Fine-Grained Visual Entailment

Visual entailment is a recently proposed multimodal reasoning task where...

0 Christopher Thomas, et al. ∙

research

∙ 03/28/2022

Few-Shot Object Detection with Fully Cross-Transformer

Few-shot object detection (FSOD), with the aim to detect novel objects u...

0 Guangxing Han, et al. ∙

research

∙ 01/26/2022

Learning To Recognize Procedural Activities with Distant Supervision

In this paper we consider the problem of classifying fine-grained, multi...

1 Xudong Lin, et al. ∙

research

∙ 01/15/2022

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

Contrastive language-image pretraining (CLIP) links vision and language ...

15 Zhecan Wang, et al. ∙

research

∙ 01/13/2022

CLIP-Event: Connecting Text and Images with Event Structures

Vision-language (V+L) pretraining models have achieved great success in ...

6 Manling Li, et al. ∙

research

∙ 12/20/2021

MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

Recently, there has been an increasing interest in building question ans...

14 Revanth Gangi Reddy, et al. ∙

research

∙ 12/17/2021

Query Adaptive Few-Shot Object Detection with Heterogeneous Graph Convolutional Networks

Few-shot object detection (FSOD) aims to detect never-seen objects using...

0 Guangxing Han, et al. ∙

research

∙ 12/16/2021

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Answering complex questions about images is an ambitious goal for machin...

4 Zhecan Wang, et al. ∙

research

∙ 12/01/2021

PreViTS: Contrastive Pretraining with Video Tracking Supervision

Videos are a rich source for self-supervised learning (SSL) of visual re...

12 Brian Chen, et al. ∙

research

∙ 09/27/2021

Joint Multimedia Event Extraction from Video and Article

Visual and textual modalities contribute complementary information about...

0 Brian Chen, et al. ∙

research

∙ 04/26/2021

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Multimodal self-supervised learning is getting more and more attention a...

0 Brian Chen, et al. ∙

research

∙ 04/22/2021

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

We present a framework for learning multimodal representations from unla...

21 Hassan Akbari, et al. ∙

research

∙ 04/15/2021

Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment

Few-shot object detection (FSOD) aims to detect objects using only few e...

0 Guangxing Han, et al. ∙

research

∙ 03/23/2021

Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos

In this paper, we address the problem of referring expression comprehens...

0 Sijie Song, et al. ∙

research

∙ 01/28/2021

VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

We present Vx2Text, a framework for text generation from multimodal inpu...

6 Xudong Lin, et al. ∙

research

∙ 12/24/2020

Task-Adaptive Negative Class Envision for Few-Shot Open-Set Recognition

Recent works seek to endow recognition systems with the ability to handl...

0 Shiyuan Huang, et al. ∙

research

∙ 11/20/2020

Open-Vocabulary Object Detection Using Captions

Despite the remarkable accuracy of deep neural networks in object detect...

8 Alireza Zareian, et al. ∙

research

∙ 11/18/2020

Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language

Neuro-symbolic representations have proved effective in learning structu...

18 Hassan Akbari, et al. ∙

research

∙ 10/24/2020

Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions

Pre-trained contextual vision-and-language (V L) models have brought i...

0 Liunian Harold Li, et al. ∙

research

∙ 10/09/2020

Uncertainty-Aware Few-Shot Image Classification

Few-shot image classification aims to learn to recognize new categories ...

0 Zhizheng Zhang, et al. ∙

research

∙ 09/03/2020

Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding

The prevailing framework for solving referring expression grounding is b...

2 Long Chen, et al. ∙

research

∙ 07/01/2020

COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation

To combat COVID-19, both clinicians and scientists need to digest the va...

18 Qingyun Wang, et al. ∙

research

∙ 06/17/2020

Learning Visual Commonsense for Robust Scene Graph Generation

Scene graph generation models understand the scene through object and pr...

0 Alireza Zareian, et al. ∙

research

∙ 05/19/2020

Deep Learning Guided Building Reconstruction from Satellite Imagery-derived Point Clouds

3D urban reconstruction of buildings from remotely sensed imagery has dr...

8 Bo Xu, et al. ∙

research

∙ 05/05/2020

Cross-media Structured Common Space for Multimedia Event Extraction

We introduce a new task, MultiMedia Event Extraction (M2E2), which aims ...

0 Manling Li, et al. ∙

research

∙ 03/08/2020

Unifying Specialist Image Embedding into Universal Image Embedding

Deep image embedding provides a way to measure the semantic similarity o...

0 Yang Feng, et al. ∙

research

∙ 02/11/2020

Training with Streaming Annotation

In this paper, we address a practical scenario where training data is re...

0 Tongtao Zhang, et al. ∙

research

∙ 01/08/2020

Weakly Supervised Visual Semantic Parsing

Scene Graph Generation (SGG) aims to extract entities, predicates and th...

0 Alireza Zareian, et al. ∙

research

∙ 01/07/2020

Bridging Knowledge Graphs to Generate Scene Graphs

Scene graphs are powerful representations that encode images into their ...

0 Alireza Zareian, et al. ∙

research

∙ 01/05/2020

General Partial Label Learning via Dual Bipartite Graph Autoencoder

We formulate a practical yet challenging problem: General Partial Label ...

11 Brian Chen, et al. ∙

research

∙ 12/10/2019

Flow-Distilled IP Two-Stream Networks for Compressed Video Action Recognition

Two-stream networks have achieved great success in video recognition. A ...

0 Shiyuan Huang, et al. ∙

research

∙ 12/10/2019

Flow-Distilled IP Two-Stream Networks for Compressed Video ActionRecognition

Two-stream networks have achieved great success in video recognition. A ...

0 Shiyuan Huang, et al. ∙

Shih-Fu Chang

Featured Co-authors

Sign in with Google

Consider DeepAI Pro