jaffar.dev

How to Build a Local RAG Pipeline Using llama.cpp in Python

April 19, 2025 | by Jaffar Ali Mohamedkasim

pexels-photo-210661-210661

In the era of powerful large language models (LLMs), Retrieval-Augmented Generation (RAG) has emerged as a smart way to combine real-time document retrieval with generative AI. Imagine having an assistant that can reason like ChatGPT, but with access to your own documents—and better yet, runs locally on your machine for full control and privacy.

In this post, I’ll show you how to build a simple yet powerful RAG pipeline using Python, llama.cpp, and a few modern open-source tools.


Why RAG?

Traditional LLMs are stateless—they don’t know anything beyond their training data. RAG enhances them by enabling access to external knowledge sources like documents, PDFs, or wikis. When a user asks a question, the system first retrieves the most relevant documents and then feeds that context to the language model to generate an informed answer.


What You’ll Need

Here’s a breakdown of the core components in our setup:

  • llama-cpp-python: Python bindings for llama.cpp to run LLMs locally on CPU/GPU.
  • LangChain: A framework to create language model pipelines.
  • ChromaDB: A lightweight vector database for storing and retrieving embeddings.
  • HuggingFace Embeddings: To convert text chunks into vector form for semantic search.

Install everything with:

pip install llama-cpp-python langchain chromadb sentence-transformers

Step 1: Load and Chunk Your Documents

Start by loading documents using LangChain’s PyPDFLoader or TextLoader. Then split them into smaller chunks so they fit within the model’s token limit.

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("your_file.pdf")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)

Step 2: Generate Embeddings and Store Them

Use Hugging Face’s sentence transformers to generate vector embeddings and store them in ChromaDB.

from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings()
vector_store = Chroma.from_documents(chunks, embedding_model, persist_directory="db")

Step 3: Load a Local LLM with llama.cpp

Download a compatible GGUF model from Hugging Face or TheBloke’s repository. Then load it using:

from langchain.llms import LlamaCpp

llm = LlamaCpp(
    model_path="./models/llama.gguf",
    n_ctx=2048,
    temperature=0.7,
    top_p=0.9,
    verbose=True
)

Step 4: Create the RAG Pipeline

Now, combine the retriever and the LLM into a RetrievalQA chain:

from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever()
)

Ask a question and get an answer:

response = qa.run("What are the key takeaways from the document?")
print(response)

Final Thoughts

This pipeline gives you the power of RAG without relying on external APIs. It’s fast, secure, and customizable. Ideal for personal assistants, document summarizers, or even enterprise search tools.

Want to go further? Add a front-end with Gradio or FastAPI, or integrate multiple documents and smarter chunking strategies.

Got questions or improvements? Drop them in the comments!

RELATED POSTS

View all

view all