We present YORO - a multi-modal transformer encoder-only architecture fo...
We present DocFormer – a multi-modal transformer based architecture for ...
How does one represent an action? How does one describe an action that w...
Joint vision and language tasks like visual question answering are
fasci...
We address the problem of semi-supervised domain adaptation of classific...