CLIP Brings Better Features to Visual Aesthetics Learners

by   Liwu Xu, et al.

The success of pre-training approaches on a variety of downstream tasks has revitalized the field of computer vision. Image aesthetics assessment (IAA) is one of the ideal application scenarios for such methods due to subjective and expensive labeling procedure. In this work, an unified and flexible two-phase CLIP-based Semi-supervised Knowledge Distillation paradigm is proposed, namely CSKD. Specifically, we first integrate and leverage a multi-source unlabeled dataset to align rich features between a given visual encoder and an off-the-shelf CLIP image encoder via feature alignment loss. Notably, the given visual encoder is not limited by size or structure and, once well-trained, it can seamlessly serve as a better visual aesthetic learner for both student and teacher. In the second phase, the unlabeled data is also utilized in semi-supervised IAA learning to further boost student model performance when applied in latency-sensitive production scenarios. By analyzing the attention distance and entropy before and after feature alignment, we notice an alleviation of feature collapse issue, which in turn showcase the necessity of feature alignment instead of training directly based on CLIP image encoder. Extensive experiments indicate the superiority of CSKD, which achieves state-of-the-art performance on multiple widely used IAA benchmarks.


page 1

page 2

page 3

page 4


SLaM: Student-Label Mixing for Semi-Supervised Knowledge Distillation

Semi-supervised knowledge distillation is a powerful training paradigm f...

A Studious Approach to Semi-Supervised Learning

The problem of learning from few labeled examples while using large amou...

Improving Knowledge Distillation via Regularizing Feature Norm and Direction

Knowledge distillation (KD) exploits a large well-trained model (i.e., t...

SelfHAR: Improving Human Activity Recognition through Self-training with Unlabeled Data

Machine learning and deep learning have shown great promise in mobile se...

ATST: Audio Representation Learning with Teacher-Student Transformer

Self-supervised learning (SSL) learns knowledge from a large amount of u...

Semi-UFormer: Semi-supervised Uncertainty-aware Transformer for Image Dehazing

Image dehazing is fundamental yet not well-solved in computer vision. Mo...

Please sign up or login with your details

Forgot password? Click here to reset