Multimodal AI
AI systems that process and generate multiple data types — text, images, audio, and video — simultaneously.
Definition
Multimodal AI integrates multiple sensory modalities within a single model, enabling cross-modal reasoning and generation. Early AI systems were modality-specific (image classifiers, language models); modern multimodal systems handle text, images, audio, video, and structured data jointly.
Architecturally, multimodal models typically encode each modality with a modality-specific encoder (e.g., ViT for images, Whisper for audio), project them into a shared embedding space, and process with a shared transformer. CLIP showed that contrastive training on image-text pairs learns powerful joint representations.
GPT-4V can answer questions about uploaded images; Gemini Ultra handles text, images, audio, and video natively; medical AI models interpret both clinical notes and radiological images. Multimodality is becoming the norm for frontier AI systems.
Examples
- GPT-4V
- Gemini
- Claude 3 (vision)
- DALL-E (text→image)
- CLIP
Related Terms
Large Language Model (LLM)
A transformer-based AI system trained on billions of tokens of text, capable of generating, reasoning about, and transforming language.
Foundation Model
A large model trained on broad data that can be adapted to many downstream tasks via fine-tuning or prompting.
Generative AI
AI systems that create new content — text, images, audio, video, code — by learning patterns from training data.