Robust Multimodal Learning via Cross-Modal Proxy Tokens

Abstract

Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces cross-modal proxy tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. To efficiently learn these approximations with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning.

Figure 1: (a) We introduce Cross-Modal Proxy Tokens (CMPTs), a novel approach to address missing modality challenges. CMPTs effectively learn to approximate missing modality class tokens by adapting pretrained encoders through a joint optimization of alignment and task-specific objectives. Our approach accommodates both complete and missing modalities during training and inference, thereby enhancing robustness across varying missing modality scenarios. (b) CMPTs achieve state-of-the-art performance, consistently outperforming recent baseline methods in both complete and missing modality scenarios. The radar plot illustrates F1-macro scores on the MM-IMDb dataset across varying modality availability.

Presentation

Contributions

Introduced Cross-Modal Proxy Tokens (CMPTs) to approximate the class token of the missing modality from the available modality.
Used rank-1 Low-Rank Adapters (LoRA) for efficient adaptation of frozen unimodal encoders.
Achieved state-of-the-art robustness across five datasets and 12 baselines under different missing-modality conditions.
Maintained competitive or superior performance when all modalities are available.
Demonstrated class-level improvements and better generalization to unseen missing scenarios during inference.

Key Results

Table 1: CMPTs outperform existing baselines across various missing-modality configurations on MM-IMDb and UPMC Food-101 datasets.

Tables 2: CMPTs outperform existing baselines even when a complete modality gets missing during inference.

Generalization and Robustness

Figure 1: Models are trained with 100% image + 100% text and evaluated with 100% image + x% text. CMPTs generalize effectively across varying missing-modality rates, retaining strong performance even with only 10% text available.

Figure 2: Per-class improvement on MM-IMDb. CMPTs enhance performance across most classes, especially modality-sensitive ones (e.g., Horror, Animation).

Qualitative Analysis

Figure 3: t-SNE visualization of fused embeddings. CMPTs (blue) align missing-modality features closely with full-modality embeddings (green).

Figure 4: Attention maps show CMPTs focus on semantically relevant regions, correcting mispredictions caused by missing modalities.

Future Directions

Design a better training strategy that encourages more uniform reliance on different modalities while maintaining overall performance in both complete and missing modality scenarios.
Develop a general framework for explaining modality contribution and expected performance drop under different missing modality scenarios.
Extend CMPTs to three or more modalities to enhance generalizability and applicability to a wide range of datasets and tasks.

Bibtex

@article{ reza2025robust, title={Robust Multimodal Learning via Cross-Modal Proxy Tokens}, author={Md Kaykobad Reza and Ameya Patil and Mashhour Solh and Salman Asif}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2025}, url={https://openreview.net/forum?id=Wtc6wvcYJ0}, note={} }