Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation

Figure 1: a) Overview of our model adaptation approach for robust MML. A model pretrained on all the modalities is adapted using a small number of learnable parameters to handle different modality combinations. We insert adaptable layers after each layer of the encoders and the fusion block to learn the modulation as a function of the available input modalities to compensate for the missing modalities. The grayed-out branch (missing modality) is inactive and does not contribute to the output. b) Low-rank model adaption computes features using frozen weights and low-rank weight updates and combine them. c) Scale and shift feature adaptation transforms input by element-wise multiplication and addition.

Abstract

Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose a simple and parameter-efficient adaptation procedure for pretrained multimodal networks. In particular, we exploit modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 1% of the total parameters) and applicable to a wide range of modality combinations and tasks. We conduct a series of experiments to highlight the missing modality robustness of our proposed method on five different multimodal tasks across seven datasets. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.

Experiments for Multimodal Segmentation

Table 1: Performance comparison with different baseline methods for multimodal semantic segmentation on MFNet and NYUDv2 datasets and multimodal material segmentation on MCubeS dataset. We use CMNeXt as the base model. Bold letters represent best results.

Figure 2: Examples of predicted segmentation maps for the Pretrained and Adapted models. Title above each subimage shows method name (available modalities). CMNeXt column shows the predictions with all the modalities. Segmentation quality improves significantly after model adaptation for all input modality combinations. Green boxes highlight areas with salient differences in results (e.g., cars and humans missing in the Pretrained model with missing modalities but visible in the Adapted model). For MCubeS dataset, we only show RGB input images for brevity. A, D and N denote angle of linear polarization, degree of linear polarization, and near-infrared, respectively.

Comparison with Robust Models and Other Adaptation Methods

Table 2: Performance comparison with existing robust methods for MFNet dataset. RGB and Thermal columns report performance when only RGB and only Thermal are available. Average column reports average performance when one of the two modalities gets missing. ‘-’ indicates that results for those cells are not published. ∗ indicates that available code and pretrained models from the authors were used to generate the results.

Table 3: Performance comparison with existing robust methods for NYUDv2 dataset. RGB and Depth columns report performance when only RGB and only Depth are available. Average column indicates average performance when one of the two modalities gets missing. ∗ indicates that available code and pretrained models from the authors were used to generate the results. Other results are from the corresponding papers.

Table 4: Performance comparison (% mIoU) of different parameter-efficient adaptation techniques for MFNet, NYUDv2, and MCubeS datasets. Each column reports mIoU of the Adapted model with the corresponding modalities, and Avg indicates average performance. A and D denote Angle and Degree of Linear Polarization.

Experiments for Multimodal Sentiment Analysis

Table 5: Comparison of our adaptation technique with existing methods for multimodal sentiment analysis on CMU-MOSI and CMU-MOSEI datasets.

Experiments for Multimodal Action Recognition and Classification

Cosine Similarity Analysis

Figure 3: Cosine similarity between complete and missing modality features of the pretrained model (Pretrained) and complete and missing modality features of the adapted model (Adapted) on MCubeS and NTU RGB+D datasets. Adapted models show higher similarity to the complete modality features compared to the pretrained model, indicating less deviation and better handling of missing modalities.

Paper

Bibtex

@ARTICLE{10713849, author={Reza, Md Kaykobad and Prater-Bennette, Ashley and Asif, M. Salman}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, title={Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation}, year={2024}, volume={}, number={}, pages={1-13}, keywords={ Adaptation models;Training;Computational modeling;Robustness;Modulation;Transforms; Sentiment analysis;Data models;Solid modeling;Knowledge engineering; Robust multimodal learning;parameter-efficient adaptation;missing modality adaptation; missing modality robustness }, doi={10.1109/TPAMI.2024.3476487} }