In a joint vision-language space, a text feature (e.g., from "a photo of...
Transformer encoder architectures have recently achieved state-of-the-ar...
Grounded situation recognition is the task of predicting the main activi...
Grounded Situation Recognition (GSR) is the task that not only classifie...