Adapting multimodal models for different missing modality scenarios
1 University of California Riverside, CA, USA 2 Air Force Research Laboratory, NY, USA
Figure 1: a) Overview of our model adaptation approach for robust MML. A model pretrained on all the modalities is adapted using a small number of learnable parameters to handle different modality combinations. We insert adaptable layers after each layer of the encoders and the fusion block to learn the modulation as a function of the available input modalities to compensate for the missing modalities. The grayed-out branch (missing modality) is inactive and does not contribute to the output. b) Low-rank model adaption computes features using frozen weights and low-rank weight updates and combine them. c) Scale and shift feature adaptation transforms input by element-wise multiplication and addition.
Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose a simple and parameter-efficient adaptation procedure for pretrained multimodal networks. In particular, we exploit modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 1% of the total parameters) and applicable to a wide range of modality combinations and tasks. We conduct a series of experiments to highlight the missing modality robustness of our proposed method on five different multimodal tasks across seven datasets. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.
Table 1: Performance comparison with different baseline methods for multimodal semantic segmentation on MFNet and NYUDv2 datasets and multimodal material segmentation on MCubeS dataset. We use CMNeXt as the base model. Bold letters represent best results.
Figure 2: Examples of predicted segmentation maps for the Pretrained and Adapted models. Title above each subimage shows method name (available modalities). CMNeXt column shows the predictions with all the modalities. Segmentation quality improves significantly after model adaptation for all input modality combinations. Green boxes highlight areas with salient differences in results (e.g., cars and humans missing in the Pretrained model with missing modalities but visible in the Adapted model). For MCubeS dataset, we only show RGB input images for brevity. A, D and N denote angle of linear polarization, degree of linear polarization, and near-infrared, respectively.
Table 2: Performance comparison with existing robust methods for MFNet dataset. RGB and Thermal columns report performance when only RGB and only Thermal are available. Average column reports average performance when one of the two modalities gets missing. ‘-’ indicates that results for those cells are not published. ∗ indicates that available code and pretrained models from the authors were used to generate the results.
Table 3: Performance comparison with existing robust methods for NYUDv2 dataset. RGB and Depth columns report performance when only RGB and only Depth are available. Average column indicates average performance when one of the two modalities gets missing. ∗ indicates that available code and pretrained models from the authors were used to generate the results. Other results are from the corresponding papers.
Table 4: Performance comparison (% mIoU) of different parameter-efficient adaptation techniques for MFNet, NYUDv2, and MCubeS datasets. Each column reports mIoU of the Adapted model with the corresponding modalities, and Avg indicates average performance. A and D denote Angle and Degree of Linear Polarization.
Table 5: Comparison of our adaptation technique with existing methods for multimodal sentiment analysis on CMU-MOSI and CMU-MOSEI datasets.
Figure 3: Cosine similarity between complete and missing modality features of the pretrained model (Pretrained) and complete and missing modality features of the adapted model (Adapted) on MCubeS and NTU RGB+D datasets. Adapted models show higher similarity to the complete modality features compared to the pretrained model, indicating less deviation and better handling of missing modalities.