Techniques

RAG (Retrieval-Augmented Generation)

Grounding LLM responses by first retrieving relevant documents from a knowledge base before generating an answer.

Definition

RAG addresses the knowledge limitations of LLMs — their training data has a cutoff date, they hallucinate facts, and they can't know private data. RAG adds a retrieval step: given a user query, the system first searches a vector database for relevant document chunks, prepends them to the prompt as context, then generates a response grounded in the retrieved material.

The pipeline involves: (1) indexing — encoding documents as embeddings and storing in a vector DB (Pinecone, Weaviate, Chroma); (2) retrieval — encoding the query, finding nearest neighbours; (3) augmentation — inserting retrieved chunks into the prompt; (4) generation — LLM generates a response citing the retrieved context.

RAG significantly reduces hallucination for factual tasks and enables LLMs to work with up-to-date or proprietary data without retraining. It is now the standard architecture for enterprise AI assistants, customer service bots, and research tools.

Examples

  • Perplexity AI (search + generate)
  • GitHub Copilot (code context)
  • Enterprise chatbots on internal docs