Research

주식회사 엑스텔리전스

Weakly Supervised Referring Image Segmentation with Infra-Chunk and In…

페이지 정보

작성자 관리자
조회 22회 작성일 26-03-19 15:27

본문

Abstract

Referring image segmentation aims to localize the object in an image referred to by a naturallanguage expression. Most previous studies rely on large-scale datasets with segmentation labels,which are costly to obtain. In this work, we present a weakly supervised learning method thatutilizes only readily available image-text pairs.We first train a vision-language model for image-text matching and extract visual saliency mapsusing Grad-CAM to identify regions corresponding to each word. However, Grad-CAM presents twomajor limitations. First, it does not sufficiently capture semantic relationships between words.To address this, we model these relationships through intra-chunk and inter-chunk consistency.Second, it tends to highlight only small regions of the target object, resulting in low recall.To overcome this, we refine localization maps using Transformer-based self-attention andunsupervised object shape priors.Experiments on benchmark datasets (RefCOCO, RefCOCO+, G-Ref) demonstrate that our methodsignificantly outperforms existing approaches. Furthermore, the proposed method is applicableacross various levels of supervision and consistently achieves superior performance.

첨부파일