RAG (Retrieval-Augmented Generation)
Grounding LLM responses by first retrieving relevant documents from a knowledge base before generating an answer.
Definition
RAG addresses the knowledge limitations of LLMs — their training data has a cutoff date, they hallucinate facts, and they can't know private data. RAG adds a retrieval step: given a user query, the system first searches a vector database for relevant document chunks, prepends them to the prompt as context, then generates a response grounded in the retrieved material.
The pipeline involves: (1) indexing — encoding documents as embeddings and storing in a vector DB (Pinecone, Weaviate, Chroma); (2) retrieval — encoding the query, finding nearest neighbours; (3) augmentation — inserting retrieved chunks into the prompt; (4) generation — LLM generates a response citing the retrieved context.
RAG significantly reduces hallucination for factual tasks and enables LLMs to work with up-to-date or proprietary data without retraining. It is now the standard architecture for enterprise AI assistants, customer service bots, and research tools.
Examples
- Perplexity AI (search + generate)
- GitHub Copilot (code context)
- Enterprise chatbots on internal docs
Related Terms
Large Language Model (LLM)
A transformer-based AI system trained on billions of tokens of text, capable of generating, reasoning about, and transforming language.
Vector Database
A database optimised for storing and querying high-dimensional embedding vectors via approximate nearest-neighbour search.
Embedding
A dense numerical vector representation of data (text, images, audio) capturing semantic meaning.
Hallucination
When an AI model generates plausible-sounding but factually incorrect or fabricated information.
Prompt Engineering
The practice of crafting effective text inputs to guide LLMs toward desired outputs.