Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Robotics Institute, Carnegie Mellon University
WACV 2025

Abstract & Highlights

Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable prediction. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation utilizing the advanced Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields with linear complexity. By employing a Siamese encoder and innovating a Mamba-based fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our proposed method is rigorously evaluated on both RGB-Thermal and RGB-Depth semantic segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks.

  • To our best knowledge, our work marks the first successful application of state space models, specifically Mamba, in multi-modal semantic segmentation.
  • We introduce a Mamba-based fusion mechanism alongside a channel-aware decoder, to efficiently extract information across different modalities and integrate them seamlessly.
  • Comprehensive evaluations in RGB-Thermal and RGB-Depth domains showcase our method's superior accuracy and efficiency, setting a new benchmark for future investigations into Mamba's potential in multi-modal learning.

Experimental Results

Quantitative Results on RGB-Thermal Datasets
fail
Table 1. Quantitative comparisons for semantic segmentation of RGB-T images on MFNet and PST900 datasets. The best and second best results in each block are highlighted in bold and underline, respectively.

Quantitative Results on RGB-Depth Datasets
fail
Table 2. Qualitative comparisons of RGB-D semantic segmentation on NYU Depth V2 and SUN RGB-D. The best and second best results in each block are highlighted in bold and underline, respectively.

Qualitative Comparisons
fail
Figure 3. Qualitative comparison on MFNet dataset. More results can be found in the supplementary material.
fail
Figure 4. Qualitative comparison on NYU Depth V2 dataset. We use HHA images for better visualization of depth modality. More results can be found in the supplementary material.
fail
Figure 5. Comparative analysis of semantic segmentation results: single-modal vs. multi-modal approach.

BibTeX

@article{wan2024sigma,
  title={Sigma: Siamese mamba network for multi-modal semantic segmentation},
  author={Wan, Zifu and Wang, Yuhao and Yong, Silong and Zhang, Pingping and Stepputtis, Simon and Sycara, Katia and Xie, Yaqi},
  journal={arXiv preprint arXiv:2404.04256},
  year={2024}
}