M M Mix: A Multimodal Multiview Transformer Ensemble

06/20/2022
by   Xuehan Xiong, et al.
2

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M M) models varying backbone sizes and input modalities. Our approach achieved 52.8 higher than last year's winning entry.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset