Retrieval-Augmented Generation (RAG)

Large Language Models (LLMs) are powerful, but have fundamental limitations: they can only generate answers based on the data they were trained on. They do not have access to private domain-specific or up-to date information unless explicity provided to them.

This is where Retrieval-Augmented Generation (RAG) comes in.

RAG is an architectural pattern that combines:

Information retrieval (searching external knowledge sources)
Text Generation (Using an LLM to produce answers)

Instead of asking an LLM to answer a question in isolation, RAG allows us to :

Retrieve relevant information from an external knowledge base
Inject that information into the model’s prompt
Generate answers that are grounded in real data

This approach significanttly reduces halucinations and makes LLMs usable for real world applications such as:

Question answering over internal documents
Knowledge assistants
Search and discovery systems
Enterprise analytics and decision support
Automated customer support

RAG Pipeline Architecture

I will take the RAG architecture below and walk through a concrete implementation using a wikipedia corpus as our knowledge base, FAISS for vector search and Langchain to orchestrate the retrieval and generation steps.

Rather than treating RAG as a black box, we break it down into clear, reproducible steps:

How raw text is ingested and prepared
Why and where text splitting happens
How embeddings are created and stored
How retrieval works at query time
How retrieved context is combined with an LLM to generate answers

RAG Pipeline Architecture

Data Collection and Ingestion

The first step is to collect and ingest raw data. Our data source is a Wikipedia corpus of articles on a wide range of topics from huggingface datasets.

Raw wikipedia articles contain alot of extra information that is not relevant to our use case. I use a pattern matching regular expressions and spacy to clean the text. The entire notebook with this implementation can be found here.

Next I load the above cleaned and saved data into using huggingface datasets.
```
from datasets import load_dataset

wiki_corpus_paths = "data/output_corpus/wikipedia_processed_001.txt.gz"
dataset = load_dataset(
    "text",
    data_files={"train": wiki_corpus_paths},
    split="train"
)

texts = dataset["text"]
print(f"Loaded {len(texts)} Wikipedia text entries")
```

At this stage, the data is still raw text and not yet usable for retrieval.

Converting Text into Documents

LangChain operates on Document objects, which wrap text and optional metadata.

We convert each Wikipedia entry into a LangChain Document:

from langchain.schema import Document

documents = [
    Document(
        page_content=text,
        metadata={
            "source": "wikipedia",
            "row_id": i
        }
    )
    for i, text in enumerate(texts)
]

This step prepares the data for text splitting, embedding, and retrieval.

Text Splitting (chunking)

Large documents are inefficient for retrieval. Instead, RAG works best when documents are split into smaller, semantically meaningful chunks.

We use LangChain’s RecursiveCharacterTextSplitter to split documents into chunks of 512 characters with an overlap of 50 characters.
```
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50
)

split_docs = text_splitter.split_documents(documents)
print(f"Split into {len(split_docs)} chunks")
```
Why this matters:
- Smaller chunks improve embedding quality
- Retrieval becomes more precise
- The LLM receives focused context instead of entire articles
Vectorization (Embedding the Chunks)

Each chunk is converted into a numerical vector using a SentenceTransformer model.

These embeddings capture the semantic meaning of the text and allow similarity search.
```
from sentence_transformers import SentenceTransformer

model_name = "all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)
```
At this point, text has been transformed into vectors that can be indexed.
Vector Storage with FAISS

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors.

We store the embeddings in a FAISS index, which allows for fast similarity search using nearest neighbor search.
```
from langchain.vectorstores import FAISS

vectorstore = FAISS.from_documents(
    split_docs,
    embedding
)
```
Retrieval at Query Time

Once our Wikipedia text has been cleaned, chunked, embedded, and stored in FAISS, the next step is retrieval. Langchain provides a simple interface for this to create this retriever.
```
from langchain_community.retrievers import FAISSRetriever

retriever = FAISSRetriever(
    vectorstore=vectorstore,
    k=3
)
```
A retriever’s job is simple: given a query, it returns the most relevant documents from the vector store.

In the above example, we create a retriever that returns the 3 most similar documents for a given query.
Generation with Context

Next step involves defining how the model should use the retrieved context.

We do this using a prompt template.
```
system_prompt = """You are a highly intelligent question answering bot.
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}
"""
```
Key ideas here:
- The model is explicitly told to use the retrieved context
- {context} is where LangChain will automatically inject the retrieved Wikipedia chunks
- The instruction to not make up answers reduces hallucinations
Creating a Chat Prompt Template

Next we wrap the system instructions into a structured chat prompt
```
prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}")
])
```
This sets up a conversation where:
- The system message defines the rules
- The human message is the user’s question
- {input} will later be replaced with the actual query
- This makes the prompt reusable for any question.
Combining Retrieved Documents into a Single Context

When multiple documents are retrieved, they need to be combined before being sent to the LLM.

That’s what create_stuff_documents_chain does.
```
document_chain = create_stuff_documents_chain(
    llm,
    prompt=prompt
)
```
This chain:
- Takes all retrieved Wikipedia chunks
- “Stuffs” them into the {context} section of the prompt
- Sends everything to the language model in one request
Creating the Retrieval-Augmented QA Chain

Finally, we connect retrieval and generation into one pipeline.
```
qa = create_retrieval_chain(
    retriever=retriever,
    document_chain=document_chain
)
```
This single chain now does the full RAG loop:
- User asks a question
- Relevant Wikipedia chunks are retrieved from FAISS
- Retrieved text is injected into the prompt
- The LLM generates an answer grounded in that context
At this point, you have a fully working RAG system.

In the next step. I will look into better strategies for chunking and searching data to improve the performance of the RAG system.

Retrieval-Augmented Generation (RAG)

RAG Pipeline Architecture

Data Collection and Ingestion

Converting Text into Documents

Text Splitting (chunking)

Vectorization (Embedding the Chunks)

Vector Storage with FAISS

Retrieval at Query Time

Generation with Context

Creating a Chat Prompt Template

Combining Retrieved Documents into a Single Context

Creating the Retrieval-Augmented QA Chain

Related Projects & Learning

Testing RAG: Retrieval-Augmented Generation (RAG)