Skip to yearly menu bar Skip to main content


SLiMe: Segment Like Me

Aliasghar Khani · Saeid Asgari · Aditya Sanghi · Ali Mahdavi Amiri · Ghassan Hamarneh

Halle B #12
[ ] [ Project Page ]
Thu 9 May 1:45 a.m. PDT — 3:45 a.m. PDT


Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image generation, image editing, and 3D shape generation. Inspired by these advancements, we explore leveraging these vision-language models for segmenting images at any desired granularity using as few as one annotated sample. We propose SLiMe, which frames this problem as an optimization task. Specifically, given a single image and its segmentation mask, we first extract our novel “weighted accumulated self-attention map” along with cross-attention map from the SD prior. Then, using these extracted maps, the text embeddings of SD are optimized to highlight the segmented region in these attention maps, which in turn can be used to derive new segmentation results. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We performed comprehensive experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.

Chat is not available.