Infrastructure

Data Labeling / Annotation

The process of tagging raw data with labels, bounding boxes, or other metadata to create supervised learning training datasets.

Definition

Data labeling (or annotation) is the human-intensive process of adding ground truth labels to raw data so that supervised ML models can learn from it. Image labeling involves drawing bounding boxes, segmentation masks, or classification tags; text labeling involves sentiment labels, entity tags, or intent classifications; speech labeling involves transcriptions.

High-quality labels are essential: "garbage in, garbage out" applies strongly to ML. Label quality depends on clear labeling guidelines, quality control checks, and agreement metrics between annotators (inter-annotator agreement, Cohen's kappa). Data labeling platforms (Scale AI, Labelbox, Snorkel AI) use combinations of human annotators, automated labeling, and active learning to reduce costs.

The shift to LLMs has changed the labeling landscape: RLHF and RLAIF require preference labeling (comparing model outputs) rather than absolute labels, and instruction-tuning datasets require diverse task demonstrations. AI-assisted labeling is reducing human bottlenecks.

Examples

  • Scale AI (enterprise data labeling)
  • Labelbox
  • Amazon Mechanical Turk
  • Snorkel AI (programmatic labeling)