LiteParse pairs fast text parsing with a two-stage agent pattern, falling back to multimodal models when tables or charts ...
Abstract: Document Visual Question Answering (DocVQA) necessitates comprehension of both the spatial layout and the textual content. Multimodal pretraining is a foundational component of existing ...
Abstract: Accurate object detection depends on the precise refinement of bounding box regression. Recent advancements in bounding box regression have introduced a variety of methodologies aimed at ...
YOLOv4 Face Detection on Yale Dataset A comprehensive implementation of YOLOv4 object detection for face recognition tasks, specifically designed for COOP training and educational purposes. This ...