MMSFormer: Multimodal Transformer for

Material and Semantic Segmentation

Performing multimodal material and semantic segmentation with transformer


Md Kaykobad Reza 1, Ashley Prater-Bennette 2, and M. Salman Asif 1

1 University of California Riverside, CA, USA
2 Air Force Research Laboratory, NY, USA

Paper (IEEE OJSP) Paper (arXiv) Code (GitHub) Webpage

Abstract


Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of each modality. In this paper, we propose a novel fusion strategy that can effectively fuse information from different modality combinations. We also propose a new model named Multi-Modal Segmentation TransFormer (MMSFormer) that incorporates the proposed fusion strategy to perform multimodal material and semantic segmentation tasks. MMSFormer outperforms current state-of-the-art models on three different datasets. As we begin with only one input modality, performance improves progressively as additional modalities are incorporated, showcasing the effectiveness of the fusion block in combining useful information from diverse input modalities. Ablation studies show that different modules in the fusion block are crucial for overall model performance. Furthermore, our ablation studies also highlight the capacity of different input modalities to improve performance in the identification of different types of materials.


Banner

Figure 1: a) Overall architecture of MMSFormer model. Each image passes through a modality-specific encoder where we extract hierarchical features. Then we fuse the extracted features using the proposed fusion block and pass the fused features to the decoder for predicting the segmentation map. (b) Illustration of the mix transformer block. Each block applies a spatial reduction before applying multi-head attention to reduce computational cost. (c) Proposed multimodal fusion block. We first concatenate all the features along the channel dimension and pass it through linear fusion layer to fuse them. Then the feature tensor is fed to linear projection and parallel convolution layers to capture multi-scale features. We use Squeeze and Excitation block [28] as channel attention in the residual connection to dynamically re-calibrate the features along the channel dimension.

Comparison with Current State of the Art Models


Banner

Table 1: performance comparison on FMB (left) and MCubeS (right) datasets. Here A, D, and N represent angle of linear polarization (AoLP), degree of linear polarization (DoLP), and near-infrared (NIR) respectively.

Banner

Figure 2: Visualization of predictions on MCubeS and PST900 datasets. Figure 2(a) shows RGB and all modalities (RGB-A-D-N) prediction from CMNeXt and our model on MCubeS dataset. For brevity, we only show the RGB image and ground truth material segmentation maps along with the predictions. Figure 2(b) shows predictions from RTFNet, FDCNet and our model for RGB-thermal input modalities on PST900 dataset. Our model shows better predictions on both of the datasets.

Banner

Table 2: Per-class % IoU comparison on MCubeS dataset. Our proposed MMSFormer model shows better performance in detecting most of the classes compared to the current state-of-the-art models. ∗ indicates that the code and pretrained model from the authors were used to generate the results.

Banner

Table 3: Per-class % IoU comparison on FMB dataset for both RGB only and RGB-infrared modalities. We show the comparison for 8 classes (out of 14) that are published. T-Lamp and T-Sign stand for Traffic Lamp and Traffic Sign respectively. Our model outperforms all the methods for all the classes except for the truck class.

Banner

Table 4: Performance comparison on PST900 dataset. We show per-class % IoU as well as % mIoU for all the classes.

Effect of Adding Different Modalities


Banner

Table 5: Per class % IoU comparison on Multimodal Material Segmentation (MCubeS) dataset for different modality combinations. As we add modalities incrementally, overall performance increases gradually. This table also shows that specific modality combinations assist in identifying specific types of materials better.

Banner

Figure 3: Visualization of predicted segmentation maps for different modality combinations on MCubeS and FMB datasets. Both figures show that prediction accuracy increases as we incrementally add new modalities. They also illustrate the fusion block’s ability to effectively combine information from different modality combinations.

Paper


Bibtex


@ARTICLE{Reza2024MMSFormer, author={Reza, Md Kaykobad and Prater-Bennette, Ashley and Asif, M. Salman}, journal={IEEE Open Journal of Signal Processing}, title={MMSFormer: Multimodal Transformer for Material and Semantic Segmentation}, year={2024}, volume={}, number={}, pages={1-12}, keywords={Image segmentation;Feature extraction;Transformers;Task analysis;Fuses;Semantic segmentation;Decoding;multimodal image segmentation;material segmentation;semantic segmentation;multimodal fusion;transformer}, doi={10.1109/OJSP.2024.3389812} }