How to Build a Local RAG Pipeline Using llama.cpp in Python
April 19, 2025 | by Jaffar Ali Mohamedkasim

In the era of powerful large language models (LLMs), Retrieval-Augmented Generation (RAG) has emerged as a smart way to combine real-time document retrieval with generative AI. Imagine having an assistant that can reason like ChatGPT, but with access to your own documents—and better yet, runs locally on your machine for full control and privacy.
In this post, I’ll show you how to build a simple yet powerful RAG pipeline using Python, llama.cpp
, and a few modern open-source tools.
Why RAG?
Traditional LLMs are stateless—they don’t know anything beyond their training data. RAG enhances them by enabling access to external knowledge sources like documents, PDFs, or wikis. When a user asks a question, the system first retrieves the most relevant documents and then feeds that context to the language model to generate an informed answer.
What You’ll Need
Here’s a breakdown of the core components in our setup:
- llama-cpp-python: Python bindings for
llama.cpp
to run LLMs locally on CPU/GPU. - LangChain: A framework to create language model pipelines.
- ChromaDB: A lightweight vector database for storing and retrieving embeddings.
- HuggingFace Embeddings: To convert text chunks into vector form for semantic search.
Install everything with:
pip install llama-cpp-python langchain chromadb sentence-transformers
Step 1: Load and Chunk Your Documents
Start by loading documents using LangChain’s PyPDFLoader
or TextLoader
. Then split them into smaller chunks so they fit within the model’s token limit.
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyPDFLoader("your_file.pdf")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)
Step 2: Generate Embeddings and Store Them
Use Hugging Face’s sentence transformers to generate vector embeddings and store them in ChromaDB.
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings()
vector_store = Chroma.from_documents(chunks, embedding_model, persist_directory="db")
Step 3: Load a Local LLM with llama.cpp
Download a compatible GGUF model from Hugging Face or TheBloke’s repository. Then load it using:
from langchain.llms import LlamaCpp
llm = LlamaCpp(
model_path="./models/llama.gguf",
n_ctx=2048,
temperature=0.7,
top_p=0.9,
verbose=True
)
Step 4: Create the RAG Pipeline
Now, combine the retriever and the LLM into a RetrievalQA
chain:
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever()
)
Ask a question and get an answer:
response = qa.run("What are the key takeaways from the document?")
print(response)
Final Thoughts
This pipeline gives you the power of RAG without relying on external APIs. It’s fast, secure, and customizable. Ideal for personal assistants, document summarizers, or even enterprise search tools.
Want to go further? Add a front-end with Gradio or FastAPI, or integrate multiple documents and smarter chunking strategies.
Got questions or improvements? Drop them in the comments!
RELATED POSTS
View all