RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation based on Visual Foundation Model



Keyan Chen1,2,3,4
Chenyang Liu1,2,3,4
Hao Chen1,2,3,4
Haotian Zhang1,2,3,4
Wenyuan Li1,2,3,4
Zhengxia Zou1,4
Zhenwei Shi ✉ 1,2,3,4

Beihang University1
Beijing Key Laboratory of Digital Media2
State Key Laboratory of Virtual Reality Technology and Systems3
Shanghai Artificial Intelligence Laboratory4

Code [GitHub]
Demo [HuggingFace]
Paper [arXiv]
Cite [BibTeX]


Teaser

(a) Depicts the instance segmentation results from point-based prompt, box-based prompt, SAM's "everything" mode (which segments all objects in the image), and RSPrompter. SAM performs category-agnostic instance segmentation, relying on manually provided prior prompts. (b) Illustrates the segmentation results of point-based prompts from different locations, a two-point based prompt, and a box-based prompt. The type, location, and quantity of prompts heavily influence SAM's results.



Abstract

Leveraging vast training data (SA-1B), the foundation Segment Anything Model (SAM) proposed by Meta AI Research exhibits remarkable generalization and zero-shot capabilities. Nonetheless, as a category-agnostic instance segmentation method, SAM heavily depends on prior manual guidance involving points, boxes, and coarse-grained masks. Additionally, its performance on remote sensing image segmentation tasks has yet to be fully explored and demonstrated. In this paper, we consider designing an automated instance segmentation approach for remote sensing images based on the SAM foundation model, incorporating semantic category information. Inspired by prompt learning, we propose a method to learn the generation of appropriate prompts for SAM input. This enables SAM to produce semantically discernible segmentation results for remote sensing images, which we refer to as RSPrompter. We also suggest several ongoing derivatives for instance segmentation tasks, based on recent developments in the SAM community, and compare their performance with RSPrompter. Extensive experimental results on the WHU building, NWPU VHR-10, and SSDD datasets validate the efficacy of our proposed method.



Architecture


From left to right, the figure illustrates SAM-seg, SAM-cls, SAM-det, and RSPrompter as alternative solutions for applying SAM to remote sensing image instance segmentation tasks. (a) An instance segmentation head is added after SAM's image encoder. (b) SAM's "everything" mode generates masks for all objects in an image, which are subsequently classified into specific categories by a classifier. (c) Object bounding boxes are first produced by an object detector and then used as prior prompts input to SAM to obtain the corresponding masks. (d) The proposed RSPrompter in this paper creates category-relevant prompt embeddings for instant segmentation masks. The snowflake symbol in the figure signifies that the model parameters in this part are kept frozen.



Quantitative Results

R1: Benchmark on WHU

The results of RSPrompter in comparison to other methods on the WHU dataset are presented in Tab., with the best performance highlighted in bold. The task involves performing single-class instance segmentation of buildings in optical RGB band remote sensing images. RSPrompter-query attains the best performance for both box and mask predictions, achieving APbox and APmask values of 70.36/69.21. Specifically, SAM-seg (Mask2Former) surpasses the original Mask2Former (60.40/62.77) with 67.84/66.66 on APbox and APmask, while SAM-seg (Mask R-CNN) exceeds the original Mask R-CNN (56.11/60.75) with 67.15/64.86. Furthermore, both RSPrompter-query and RSPrompter-anchor improve the performance to 70.36/69.21 and 68.06/66.89, respectively, outperforming SAM-det, which carries out detection before segmentation. These observations suggest that the learning-to-prompt approach effectively adapts SAM for instance segmentation tasks in optical remote sensing images. Moreover, they demonstrate that the SAM backbone, trained on an extensive dataset, can provide valuable instance segmentation guidance even when it is fully frozen (as seen in SAM-seg).


R2: Benchmark on NWPU

We conduct comparison experiments on the NWPU dataset to further validate RSPrompter's effectiveness. Unlike the WHU dataset, this one is smaller in size but encompasses more instance categories, amounting to 10 classes of remote sensing objects. The experiment remains focused on optical RGB band remote sensing image instance segmentation. Tab. exhibits the overall results of various methods on this dataset. It can be observed that RSPrompter-anchor, when compared to other approaches, generates the best results on box and mask predictions (68.54/67.64). In comparison to Mask R-CNN-based methods, single-stage methods display a substantial decline in performance on this dataset, particularly the Transformer-based Mask2Former. This may be because the dataset is relatively small, making it challenging for single-stage methods to achieve adequate generalization across the full data domain, especially for Transformer-based methods that require a large amount of training data. Nonetheless, it is worth noting that the performance of SAM-based SAM-seg (Mask2Former) and RSPrompter-query remains impressive. The performance improves from 29.60/35.02 for Mask2Former to 40.56/45.11 for SAM-seg (Mask2Former) and further to 58.79/65.29 for RSPrompter-query. These findings imply that SAM, when trained on a large amount of data, can exhibit significant generalization ability on a small dataset. Even when there are differences in the image domain, SAM's performance can be enhanced through the learning-to-prompt approach.


R2: Benchmark on SSDD

In this study, we carried out an evaluation of the SSDD dataset to thoroughly assess the capability of SAMpromper in performing remote sensing image instance segmentation tasks. The SSDD dataset is a single-category SAR ship instance segmentation dataset, representing a distinctly different modality compared to the previously mentioned datasets and exhibiting significant variations in training data from SAM. Tab. displays the AP values obtained for different methods on the dataset. It can be observed that the SAM-seg (Mask2Former) (49.08/54.03) and SAM-seg (Mask R-CNN) (62.41/59.46) approaches, which are based on the SAM backbone, demonstrate lower performance compared to the original Mask2Former (53.40/56.52) and Mask R-CNN (63.40/60.92). This suggests a considerable disparity between the SAM training image domain and the SAR data domain. Nonetheless, SAM-det achieved a significant performance enhancement (69.42/64.09), indicating that employing a detection-first approach followed by SAM for segmentation can yield high performance in SAM's cross-domain generalization. This is because SAM can provide accurate, category-agnostic, and highly generalized segmentation results with fine-grained intervention prompts. By unlocking the constrained space, RSPrompter-anchor further enhanced the performance to 73.09/72.61, thereby confirming the effectiveness of RSPrompter as well.



Visualizations

To facilitate a more effective visual comparison with other methods, we present a qualitative analysis of the segmentation results obtained from SAM-based techniques and other state-of-the-art instance segmentation approaches. The following figures depict sample segmentation instances from the WHU dataset, NWPU dataset, and SSDD dataset, respectively. It can be observed that the proposed RSPrompter yields notable visual improvements in instance segmentation. Compared to alternative methods, the RSPrompter generates superior results, exhibiting sharper edges, more distinct contours, enhanced completeness, and a closer resemblance to the ground-truth references.

Visualizations of the segmentation results obtained from the WHU dataset

Visualizations of the segmentation results obtained from the NWPU dataset

Visualizations of the segmentation results obtained from the SSDD dataset



Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.