Data Labeling / Annotation
The process of tagging raw data with labels, bounding boxes, or other metadata to create supervised learning training datasets.
Definition
Data labeling (or annotation) is the human-intensive process of adding ground truth labels to raw data so that supervised ML models can learn from it. Image labeling involves drawing bounding boxes, segmentation masks, or classification tags; text labeling involves sentiment labels, entity tags, or intent classifications; speech labeling involves transcriptions.
High-quality labels are essential: "garbage in, garbage out" applies strongly to ML. Label quality depends on clear labeling guidelines, quality control checks, and agreement metrics between annotators (inter-annotator agreement, Cohen's kappa). Data labeling platforms (Scale AI, Labelbox, Snorkel AI) use combinations of human annotators, automated labeling, and active learning to reduce costs.
The shift to LLMs has changed the labeling landscape: RLHF and RLAIF require preference labeling (comparing model outputs) rather than absolute labels, and instruction-tuning datasets require diverse task demonstrations. AI-assisted labeling is reducing human bottlenecks.
Examples
- Scale AI (enterprise data labeling)
- Labelbox
- Amazon Mechanical Turk
- Snorkel AI (programmatic labeling)
Companies using this
Related Terms
Synthetic Data
Artificially generated data that mimics real data properties, used when real data is scarce, sensitive, or biased.
Machine Learning (ML)
A subset of AI where systems learn patterns from data rather than following explicitly programmed rules.
RLHF (Reinforcement Learning from Human Feedback)
A training technique where human preference ratings guide language model fine-tuning to produce more helpful, harmless outputs.