Research

주식회사 엑스텔리전스

Toward Interactive Regional Understanding in Vision-Large Language Mod…

페이지 정보

작성자 관리자
조회 20회 작성일 26-03-19 15:25

본문

Abstract

Recent Vision-Language Pre-training (VLP) models have demonstrated significant advancements.Nevertheless, these models heavily rely on image-text pairs that capture only coarse and globalinformation of an image, leading to a limitation in their regional understanding ability.In this work, we introduce RegionVLM, equipped with explicit regional modeling capabilities,allowing it to understand user-indicated image regions.To achieve this, we design a simple yet innovative architecture, requiring no modifications tothe model architecture or objective function.Additionally, we leverage a dataset that contains a novel source of information, namelyLocalized Narratives, which has been overlooked in previous VLP research.Our experiments demonstrate that our single generalist model not only achieves an interactivedialogue system but also exhibits superior performance on various zero-shot region understandingtasks, without compromising its ability for global image understanding.