Abstract: We present an architecture and a training recipe that adapts pretrained open-world image models to localization in videos. Understanding the open visual world (without being constrained by ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results