Foundations and Recent Trends in Robust Multimodal Learning

Summary

This tutorial provides a comprehensive overview of robust multimodal learning, encompassing both foundational concepts and recent advancements. Its primary aim is to present the presenters’ perspective on this broad field, with a particular emphasis on underlying architectures and models rather than solely on performance analysis. The material is drawn from a wide range of academic papers, blogs, and other resources. The tutorial begins with an introduction to the fundamentals of multimodal learning, including data fusion, alignment, and representation learning. It then examines the challenges of ensuring robustness, especially in scenarios involving missing, unaligned, or noisy data. Following this, recent trends in multimodal learning and their applications are discussed. The session concludes with open questions and potential directions for future research.

Index of Topics

Part 1: Introduction to Multimodal Learning

Definition and Core Concepts
Applications
- Autonomous Vehicles
- Medical Imaging
- Remote Sensing
- Activity Recognition
- Image/Video Captioning and Visual Question Answering (VQA)
- Text-to-Image/Video/Audio Generation
General Architecture of Multimodal Models

Part 2: Fundamentals of Multimodal Learning

Core Tasks: Fusion, Alignment, and Representation Learning
Multimodal Data Fusion
- Strategies: Early, Late, and Intermediate Fusion
- Architectures: Convolutional Networks (ConvNets) and Transformers
Multimodal Alignment and Representation
- Techniques: Contrastive Learning (e.g., CLIP)
- Models: VideoBERT, ViLBERT, VATT, MaMMUT, ImageBind, Meta-Transformer

Part 3: Challenges in Robust Multimodal Learning

Missing or Incomplete Data
- Effects of Missing Modalities
- Handling Missing Data: Robust Model Design, Model Adaptation, and Prompt Tuning
Noisy and Unaligned Modalities
- Approaches: Data-level, Model-level, and Representation Alignment
Privacy and Safety Alignment
- Model Editing and Safety Alignment Concepts
- Approaches: LlavaGuard and Textual Unlearning

Part 4: Recent Advances & Applications

Timeline of Large Multimodal Model Development
Key Models and Architectures
- Vision-Language: Flamingo, LLaVA
- Audio-Language: Whisper
- Sensor Data Models: LLaSA, SensorLM
- Omni-Modal Models: OneLLM, NExTGPT
- Generative Models: Stable Diffusion 3.5 (Text-to-Image), Veo (Text-to-Video), Segment Anything 2 (Segmentation)

Part 5: Open Problems & Future Directions

Key Research Areas
- Lack of Theoretical Foundations
- Interpretability and Explainability
- Adaptive and Cost-Efficient Learning
- Fair & Balanced Multimodality

Tutorial Slides

References

Reza, Md Kaykobad, Ashley Prater-Bennette, and M. Salman Asif. "Robust multimodal learning with missing modalities via parameter-efficient adaptation." IEEE TPAMI (2024).
Reza, Md Kaykobad, Ashley Prater-Bennette, and M. Salman Asif. "Mmsformer: Multimodal transformer for material and semantic segmentation." IEEE Open Journal of Signal Processing 5 (2024): 599-610.
Reza, Md Kaykobad, et al. "Robust Multimodal Learning via Cross-Modal Proxy Tokens." arXiv:2501.17823.
Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv:2307.10802.
Vaswani, Ashish, et al. "Attention is all you need." NeurIPS 2017.
Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv:2010.11929.
Kim, Wonjae et al. "Vilt: Vision-and-language transformer without convolution or region supervision." ICML 2021.
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML 2021.
Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." CVPR 2023.
Wang, Hu, et al. "Multi-modal learning with missing modality via shared-specific feature modelling." CVPR 2023.
Cai, Zikui, Yaoteng Tan, M. Salman Asif. "Targeted Unlearning with Single Layer Unlearning Gradient." ICML 2025.
Chakraborty, T., et al. "Can Textual Unlearning Solve Cross-Modality Safety Alignment?." EMNLP Findings 2024.
Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning." NeurIPS 2022.
Liu, Haotian, et al. "Visual instruction tuning." NeurIPS 2023.
Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." ICML 2023.
Han, Jiaming, et al. "Onellm: One framework to align all modalities with language." CVPR 2024.
Wu, Shengqiong, et al. "Next-gpt: Any-to-any multimodal llm." ICML 2024.
Imran, Sheikh Asif, et al. "Llasa: A multimodal llm for human activity analysis through wearable and smartphone sensors." arXiv:2406.14498.
Zhang, Yuwei, et al. "SensorLM: Learning the Language of Wearable Sensors." arXiv:2506.09108.
Lee, Yi-Lun, et al. "Multimodal prompting with missing modalities for visual recognition." CVPR 2023.
Li, Junnan, et al. "Align before fuse: Vision and language representation learning with momentum distillation." NeurIPS 2021.
MMML Tutorials: https://cmu-multicomp-lab.github.io/mmml-tutorial/icml2023/
Based on insights from IBM Research, CMU MultiComp Lab, and Wikipedia.