Techniques

Self-Supervised Learning

Learning representations from unlabelled data by creating supervisory signals from the data itself.

Definition

Self-supervised learning (SSL) creates labels automatically from the structure of the data itself, eliminating the need for expensive human annotation. The model is trained on a pretext task (predicting masked tokens, predicting image patches, contrastive learning) that doesn't require human labels but forces the model to learn useful representations.

Language model pre-training (next-token prediction, masked language modelling) is a form of SSL. For vision, methods include MAE (Masked Autoencoder), DINO, and SimCLR. CLIP uses contrastive SSL on paired image-text data from the internet.

SSL has proven that representations learned without labels can be remarkably general — transferring to downstream tasks with minimal fine-tuning. It is the foundation of the foundation model paradigm and has largely displaced supervised pre-training for language and is challenging it in vision.

Examples

  • GPT pre-training (predict next token)
  • BERT (mask and predict)
  • CLIP (contrastive image-text)
  • DINO (self-distillation)