Abstract: Contrastive learning-based vision-language pretraining approaches, such as CLIP, have demonstrated great success in many vision-language tasks. These methods achieve cross-modal alignment by ...