Learning to approximate missing modality features efficiently
1 University of California Riverside, CA, USA 2 Amazon, CA, USA
Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces cross-modal proxy tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. To efficiently learn these approximations with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning.
Figure 1: (a) We introduce Cross-Modal Proxy Tokens (CMPTs), a novel approach to address missing modality challenges. CMPTs effectively learn to approximate missing modality class tokens by adapting pretrained encoders through a joint optimization of alignment and task-specific objectives. Our approach accommodates both complete and missing modalities during training and inference, thereby enhancing robustness across varying missing modality scenarios. (b) CMPTs achieve state-of-the-art performance, consistently outperforming recent baseline methods in both complete and missing modality scenarios. The radar plot illustrates F1-macro scores on the MM-IMDb dataset across varying modality availability.
Table 1: CMPTs outperform existing baselines across various missing-modality configurations on MM-IMDb and UPMC Food-101 datasets.
Tables 2: CMPTs outperform existing baselines even when a complete modality gets missing during inference.
Figure 1: Models are trained with 100% image + 100% text and evaluated with 100% image + x% text. CMPTs generalize effectively across varying missing-modality rates, retaining strong performance even with only 10% text available.
Figure 2: Per-class improvement on MM-IMDb. CMPTs enhance performance across most classes, especially modality-sensitive ones (e.g., Horror, Animation).
Figure 3: t-SNE visualization of fused embeddings. CMPTs (blue) align missing-modality features closely with full-modality embeddings (green).
Figure 4: Attention maps show CMPTs focus on semantically relevant regions, correcting mispredictions caused by missing modalities.