Retrieval-Augmented Generation (RAG)

Large Language Models (LLMs) are powerful, but have fundamental limitations: they can only generate answers based on the data they were trained on. They do not have access to private domain-specific or up-to date information unless explicity provided to them.

This is where Retrieval-Augmented Generation (RAG) comes in.

RAG is an architectural pattern that combines:

  • Information retrieval (searching external knowledge sources)
  • Text Generation (Using an LLM to produce answers)

Instead of asking an LLM to answer a question in isolation, RAG allows us to :

  1. Retrieve relevant information from an external knowledge base
  2. Inject that information into the model’s prompt
  3. Generate answers that are grounded in real data

This approach significanttly reduces halucinations and makes LLMs usable for real world applications such as:

  • Question answering over internal documents
  • Knowledge assistants
  • Search and discovery systems
  • Enterprise analytics and decision support
  • Automated customer support

RAG Pipeline Architecture

I will take the RAG architecture below and walk through a concrete implementation using a wikipedia corpus as our knowledge base, FAISS for vector search and Langchain to orchestrate the retrieval and generation steps.

Rather than treating RAG as a black box, we break it down into clear, reproducible steps:

  • How raw text is ingested and prepared
  • Why and where text splitting happens
  • How embeddings are created and stored
  • How retrieval works at query time
  • How retrieved context is combined with an LLM to generate answers

RAG Pipeline Architecture


  1. Data Collection and Ingestion

    The first step is to collect and ingest raw data. Our data source is a Wikipedia corpus of articles on a wide range of topics from huggingface datasets.

    Raw wikipedia articles contain alot of extra information that is not relevant to our use case. I use a pattern matching regular expressions and spacy to clean the text. The entire notebook with this implementation can be found here.

    Next I load the above cleaned and saved data into using huggingface datasets.

    from datasets import load_dataset
    
    wiki_corpus_paths = "data/output_corpus/wikipedia_processed_001.txt.gz"
    dataset = load_dataset(
        "text",
        data_files={"train": wiki_corpus_paths},
        split="train"
    )
    
    texts = dataset["text"]
    print(f"Loaded {len(texts)} Wikipedia text entries")
    

At this stage, the data is still raw text and not yet usable for retrieval.

  1. Converting Text into Documents

    LangChain operates on Document objects, which wrap text and optional metadata.

    We convert each Wikipedia entry into a LangChain Document:

    from langchain.schema import Document
    
    documents = [
        Document(
            page_content=text,
            metadata={
                "source": "wikipedia",
                "row_id": i
            }
        )
        for i, text in enumerate(texts)
    ]
    

    This step prepares the data for text splitting, embedding, and retrieval.

  2. Text Splitting (chunking)

    Large documents are inefficient for retrieval. Instead, RAG works best when documents are split into smaller, semantically meaningful chunks.

    We use LangChain’s RecursiveCharacterTextSplitter to split documents into chunks of 512 characters with an overlap of 50 characters.

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=50
    )
    
    split_docs = text_splitter.split_documents(documents)
    print(f"Split into {len(split_docs)} chunks")
    

    Why this matters:

    • Smaller chunks improve embedding quality
    • Retrieval becomes more precise
    • The LLM receives focused context instead of entire articles
  3. Vectorization (Embedding the Chunks)

    Each chunk is converted into a numerical vector using a SentenceTransformer model.

    These embeddings capture the semantic meaning of the text and allow similarity search.

    from sentence_transformers import SentenceTransformer
    
    model_name = "all-MiniLM-L6-v2"
    model = SentenceTransformer(model_name)
    

    At this point, text has been transformed into vectors that can be indexed.

  4. Vector Storage with FAISS

    FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors.

    We store the embeddings in a FAISS index, which allows for fast similarity search using nearest neighbor search.

    from langchain.vectorstores import FAISS
    
    vectorstore = FAISS.from_documents(
        split_docs,
        embedding
    )
    
  5. Retrieval at Query Time

    Once our Wikipedia text has been cleaned, chunked, embedded, and stored in FAISS, the next step is retrieval. Langchain provides a simple interface for this to create this retriever.

    from langchain_community.retrievers import FAISSRetriever
    
    retriever = FAISSRetriever(
        vectorstore=vectorstore,
        k=3
    )
    

    A retriever’s job is simple: given a query, it returns the most relevant documents from the vector store.

    In the above example, we create a retriever that returns the 3 most similar documents for a given query.

  6. Generation with Context

    Next step involves defining how the model should use the retrieved context.

    We do this using a prompt template.

    system_prompt = """You are a highly intelligent question answering bot.
    Use the following pieces of context to answer the question at the end.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
    {context}
    """
    

    Key ideas here:

    • The model is explicitly told to use the retrieved context
    • {context} is where LangChain will automatically inject the retrieved Wikipedia chunks
    • The instruction to not make up answers reduces hallucinations
  7. Creating a Chat Prompt Template

    Next we wrap the system instructions into a structured chat prompt

    prompt = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        ("human", "{input}")
    ])
    

    This sets up a conversation where:

    • The system message defines the rules
    • The human message is the user’s question
    • {input} will later be replaced with the actual query
    • This makes the prompt reusable for any question.
  8. Combining Retrieved Documents into a Single Context

    When multiple documents are retrieved, they need to be combined before being sent to the LLM.

    That’s what create_stuff_documents_chain does.

    document_chain = create_stuff_documents_chain(
        llm,
        prompt=prompt
    )
    

    This chain:

    • Takes all retrieved Wikipedia chunks
    • “Stuffs” them into the {context} section of the prompt
    • Sends everything to the language model in one request
  9. Creating the Retrieval-Augmented QA Chain

    Finally, we connect retrieval and generation into one pipeline.

    qa = create_retrieval_chain(
        retriever=retriever,
        document_chain=document_chain
    )
    

    This single chain now does the full RAG loop:

    • User asks a question
    • Relevant Wikipedia chunks are retrieved from FAISS
    • Retrieved text is injected into the prompt
    • The LLM generates an answer grounded in that context

    At this point, you have a fully working RAG system.

In the next step. I will look into better strategies for chunking and searching data to improve the performance of the RAG system.

Related Projects & Learning