Abstract: Video question answering has become a cornerstone task for evaluating vision language models. However, existing models often fail to ground their answers in relevant visual evidence or ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results