Foundations and Recent Trends in Robust Multimodal Learning

A Tutorial at the IEEE International Conference on Image Processing (ICIP) 2025


M. Salman Asif, Md Kaykobad Reza

University of California Riverside, CA, USA

Summary


This tutorial provides a comprehensive overview of robust multimodal learning, encompassing both foundational concepts and recent advancements. Its primary aim is to present the presenters’ perspective on this broad field, with a particular emphasis on underlying architectures and models rather than solely on performance analysis. The material is drawn from a wide range of academic papers, blogs, and other resources. The tutorial begins with an introduction to the fundamentals of multimodal learning, including data fusion, alignment, and representation learning. It then examines the challenges of ensuring robustness, especially in scenarios involving missing, unaligned, or noisy data. Following this, recent trends in multimodal learning and their applications are discussed. The session concludes with open questions and potential directions for future research.

Index of Topics


Part 1: Introduction to Multimodal Learning

Part 2: Fundamentals of Multimodal Learning

Part 3: Challenges in Robust Multimodal Learning

Part 4: Recent Advances & Applications

Part 5: Open Problems & Future Directions

Tutorial Slides


References


  1. Reza, Md Kaykobad, Ashley Prater-Bennette, and M. Salman Asif. "Robust multimodal learning with missing modalities via parameter-efficient adaptation." IEEE TPAMI (2024).
  2. Reza, Md Kaykobad, Ashley Prater-Bennette, and M. Salman Asif. "Mmsformer: Multimodal transformer for material and semantic segmentation." IEEE Open Journal of Signal Processing 5 (2024): 599-610.
  3. Reza, Md Kaykobad, et al. "Robust Multimodal Learning via Cross-Modal Proxy Tokens." arXiv:2501.17823.
  4. Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv:2307.10802.
  5. Vaswani, Ashish, et al. "Attention is all you need." NeurIPS 2017.
  6. Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv:2010.11929.
  7. Kim, Wonjae et al. "Vilt: Vision-and-language transformer without convolution or region supervision." ICML 2021.
  8. Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML 2021.
  9. Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." CVPR 2023.
  10. Wang, Hu, et al. "Multi-modal learning with missing modality via shared-specific feature modelling." CVPR 2023.
  11. Cai, Zikui, Yaoteng Tan, M. Salman Asif. "Targeted Unlearning with Single Layer Unlearning Gradient." ICML 2025.
  12. Chakraborty, T., et al. "Can Textual Unlearning Solve Cross-Modality Safety Alignment?." EMNLP Findings 2024.
  13. Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning." NeurIPS 2022.
  14. Liu, Haotian, et al. "Visual instruction tuning." NeurIPS 2023.
  15. Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." ICML 2023.
  16. Han, Jiaming, et al. "Onellm: One framework to align all modalities with language." CVPR 2024.
  17. Wu, Shengqiong, et al. "Next-gpt: Any-to-any multimodal llm." ICML 2024.
  18. Imran, Sheikh Asif, et al. "Llasa: A multimodal llm for human activity analysis through wearable and smartphone sensors." arXiv:2406.14498.
  19. Zhang, Yuwei, et al. "SensorLM: Learning the Language of Wearable Sensors." arXiv:2506.09108.
  20. Lee, Yi-Lun, et al. "Multimodal prompting with missing modalities for visual recognition." CVPR 2023.
  21. Li, Junnan, et al. "Align before fuse: Vision and language representation learning with momentum distillation." NeurIPS 2021.
  22. MMML Tutorials: https://cmu-multicomp-lab.github.io/mmml-tutorial/icml2023/
  23. Based on insights from IBM Research, CMU MultiComp Lab, and Wikipedia.