Injecting Image Details into CLIP's Feature Space

08/31/2022
by   Zilun Zhang, et al.
34

Although CLIP-like Visual Language Models provide a functional joint feature space for image and text, due to the limitation of the CILP-like model's image input size (e.g., 224), subtle details are lost in the feature representation if we input high-resolution images (e.g., 2240). In this work, we introduce an efficient framework that can produce a single feature representation for a high-resolution image that injects image details and shares the same semantic space as the original CLIP. In the framework, we train a feature fusing model based on CLIP features extracted from a carefully designed image patch method that can cover objects of any scale, weakly supervised by image-agnostic class prompted queries. We validate our framework by retrieving images from class prompted queries on the real world and synthetic datasets, showing significant performance improvement on these tasks. Furthermore, to fully demonstrate our framework's detail retrieval ability, we construct a CLEVR-like synthetic dataset called CLVER-DS, which is fully annotated and has a controllable object scale.

READ FULL TEXT

page 2

page 7

page 11

page 19

page 20

research
03/17/2021

ShipSRDet: An End-to-End Remote Sensing Ship Detector Using Super-Resolved Feature Representation

High-resolution remote sensing images can provide abundant appearance in...
research
08/27/2023

High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net

Shadows often occur when we capture the documents with casual equipment,...
research
12/10/2020

Super-resolution Guided Pore Detection for Fingerprint Recognition

Performance of fingerprint recognition algorithms substantially rely on ...
research
04/26/2023

Streamlined Global and Local Features Combinator (SGLC) for High Resolution Image Dehazing

Image Dehazing aims to remove atmospheric fog or haze from an image. Alt...
research
03/28/2022

HDR Reconstruction from Bracketed Exposures and Events

Reconstruction of high-quality HDR images is at the core of modern compu...
research
12/08/2018

Variational Saccading: Efficient Inference for Large Resolution Images

Image classification with deep neural networks is typically restricted t...
research
10/13/2022

Wider and Higher: Intensive Integration and Global Foreground Perception for Image Matting

This paper reviews recent deep-learning-based matting research and conce...

Please sign up or login with your details

Forgot password? Click here to reset