Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

❀ Robotics Institute, Carnegie Mellon University, USA

❦ School of Future Technology, Dalian University of Technology, China

WACV 2025 (Oral Presentation)

fail — **Figure 1. Overall architecture of Sigma.** Our proposed method comprises a siamese mamba encoder, a fusion module, and a channel-aware mamba decoder. During the encoding phase, four Visual State Space Blocks (VSSB) with downsampling operations are sequentially cascaded to extract multi-level image features. Subsequently, features from each level, derived from the two branches, are processed through a fusion module. In the decoding phase, the fused features at each level are further enhanced by a Channel-Aware Visual State Space Block (CAVSSB) with an upsampling operation. Ultimately, the final feature is forwarded to a classifier to generate the prediction. More details can be found in the paper.

Abstract & Highlights

Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable prediction. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation utilizing the advanced Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields with linear complexity. By employing a Siamese encoder and innovating a Mamba-based fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our proposed method is rigorously evaluated on both RGB-Thermal and RGB-Depth semantic segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks.

To our best knowledge, our work marks the first successful application of state space models, specifically Mamba, in multi-modal semantic segmentation.
We introduce a Mamba-based fusion mechanism alongside a channel-aware decoder, to efficiently extract information across different modalities and integrate them seamlessly.
Comprehensive evaluations in RGB-Thermal and RGB-Depth domains showcase our method's superior accuracy and efficiency, setting a new benchmark for future investigations into Mamba's potential in multi-modal learning.

Experimental Results

Quantitative Results on RGB-Thermal Datasets

Table 1. Quantitative comparisons for semantic segmentation of RGB-T images on MFNet and PST900 datasets. The best and second best results in each block are highlighted in bold and underline, respectively.

Quantitative Results on RGB-Depth Datasets

Table 2. Qualitative comparisons of RGB-D semantic segmentation on NYU Depth V2 and SUN RGB-D. The best and second best results in each block are highlighted in bold and underline, respectively.

Qualitative Comparisons

Figure 3. Qualitative comparison on MFNet dataset. More results can be found in the supplementary material.

Figure 4. Qualitative comparison on NYU Depth V2 dataset. We use HHA images for better visualization of depth modality. More results can be found in the supplementary material.

Figure 5. Comparative analysis of semantic segmentation results: single-modal vs. multi-modal approach.

@article{wan2024sigma, title={Sigma: Siamese mamba network for multi-modal semantic segmentation}, author={Wan, Zifu and Wang, Yuhao and Yong, Silong and Zhang, Pingping and Stepputtis, Simon and Sycara, Katia and Xie, Yaqi}, journal={arXiv preprint arXiv:2404.04256}, year={2024} }

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Abstract & Highlights

Experimental Results

BibTeX