[Local AI with Ollama] Building a RAG System with Ollama and LangChain

Retrieval-Augmented Generation (RAG) enables your LLM-powered assistant to answer questions using up-to-date and domain-specific knowledge from your own files. This guide will show you how to build a complete, local RAG pipeline with Ollama (for LLM and embeddings) and LangChain (for orchestration)—step by step, using a real PDF, and add a simple UI with Streamlit.

What is RAG and Why Use It?

Language models are powerful, but limited to their training data. RAG (Retrieval-Augmented Generation) connects an LLM with a retriever that searches your own documents, making it possible to answer questions grounded in real, current, and private knowledge.

How RAG works:

Your documents are split into chunks and converted to embeddings (vectors).
Both the embeddings and the original text are stored in a vector database.
Your query is also embedded and used to retrieve similar chunks.
The LLM generates an answer using both your query and the retrieved content.

RAG reduces hallucinations and makes your assistant far more useful for practical, real-world scenarios.

Core Concepts

Document Loading & Chunking: Load sources (PDFs, text, URLs) and split into smaller pieces.
Embeddings: Each chunk is transformed into a vector using an embedding model; similar content produces similar vectors.
Vector Database: Both vectors and original text are stored in a special database (e.g., ChromaDB).
Retrieval: When you query, your question is embedded and the database finds the most relevant chunks. These are passed to the LLM for a grounded answer.

Setup

- Folder structure:

project/
├── data/
│   └──pca_tutorial.pdf          
├── requirements.txt
├── pdf_rag.py               # Main script
├── pdf_rag_streamlit.py     # Streamlit UI version

- Create a virtual environment

python3 -m venv venv

- Activate the environment

On macOS/Linux:

source venv/bin/activate

On Windows:

venv\Scripts\activate

- requirements.txt

chromadb
elevenlabs
fastembed
langchain
langchain-community
langchain-core
langchain-ollama
langchain_text_splitters
ollama
pdfplumber
sentence-transformers
unstructured
unstructured[all-docs]

Install dependencies:

pip install -r requirements.txt

RAG Pipeline Example

Step 1 : Load and Ingest a PDF

We start by loading a PDF file. You can either upload one or use a local path:

from langchain_community.document_loaders import UnstructuredPDFLoader
import os

pdf_file = "./data/pca_tutorial.pdf"
embedding_model_name = "nomic-embed-text"
llm_model_name = "llama3.2"

if os.path.exists(pdf_file):
    pdf_loader = UnstructuredPDFLoader(file_path=pdf_file)
    documents = pdf_loader.load()
    print("[✓] PDF file loaded successfully.")
else:
    raise FileNotFoundError("PDF file not found. Please upload a valid file.")

Step 2 : Split the PDF into Chunks

To make the document searchable, we need to chunk it into smaller parts.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=300)
doc_chunks = splitter.split_documents(documents)
print(f"[✓] Document split into {len(doc_chunks)} chunks.")

Step 3 : Embed Chunks into Vector Database

We'll use the nomic-embed-text embedding model via Ollama, and store everything in ChromaDB:

from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
import ollama


print("Pulling embedding model...")
ollama.pull(embedding_model_name)


vector_store = Chroma.from_documents(
    documents=doc_chunks,
    embedding=OllamaEmbeddings(model=embedding_model_name),
    collection_name="pca_tutorial_collection",
)
print("[✓] Chunks embedded and stored in vector database.")

Step 4 : Setup the Retriever with Multi-Query

To improve retrieval quality, we generate multiple variations of the input question using an LLM:

from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_core.runnables import RunnablePassthrough


llm = ChatOllama(model=llm_model_name)


multi_query_prompt = PromptTemplate(
    input_variables=["question"],
    template=(
        "You are an AI assistant. Generate 5 different versions of the user question "
        "to improve document retrieval from a vector database.\n"
        "Original question: {question}"
    ),
)


retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(),
    llm=llm,
    prompt=multi_query_prompt,
)

print("[✓] Retriever with multi-query setup complete.")

Step 5 : Ask Questions with RAG Pipeline

Let's define the full chain that combines retrieval + prompt + LLM:

from langchain_core.output_parsers import StrOutputParser

context_prompt = ChatPromptTemplate.from_template(
    "Answer the question using ONLY the context below:\n{context}\n\nQuestion: {question}"
)

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | context_prompt
    | llm
    | StrOutputParser()
)

# === Example Query ===
user_question = "What is the goal of PCA?"
response = rag_chain.invoke(user_question)

print("\n=== RAG Response ===\n")
print(response)

Step 6 : Full Source Code

Here's the complete script you can run end-to-end:

from langchain_community.document_loaders import UnstructuredPDFLoader
import os


pdf_file = "./data/pca_tutorial.pdf"
embedding_model_name = "nomic-embed-text"
llm_model_name = "llama3.2"


# Load and Ingest a PDF
if os.path.exists(pdf_file):
    pdf_loader = UnstructuredPDFLoader(file_path=pdf_file)
    documents = pdf_loader.load()
    print("[✓] PDF file loaded successfully.")
else:
    raise FileNotFoundError("PDF file not found. Please upload a valid file.")


# Split the PDF into Chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter


splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=300)
doc_chunks = splitter.split_documents(documents)
print(f"[✓] Document split into {len(doc_chunks)} chunks.")


# Embed Chunks into Vector Database
from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
import ollama


print("Pulling embedding model...")
ollama.pull(embedding_model_name)


vector_store = Chroma.from_documents(
    documents=doc_chunks,
    embedding=OllamaEmbeddings(model=embedding_model_name),
    collection_name="pca_tutorial_collection",
)
print("[✓] Chunks embedded and stored in vector database.")


#  Setup the Retriever with Multi-Query
from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_core.runnables import RunnablePassthrough



llm = ChatOllama(model=llm_model_name)



multi_query_prompt = PromptTemplate(
    input_variables=["question"],
    template=(
        "You are an AI assistant. Generate 5 different versions of the user question "
        "to improve document retrieval from a vector database.\n"
        "Original question: {question}"
    ),
)



retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(),
    llm=llm,
    prompt=multi_query_prompt,
)


print("[✓] Retriever with multi-query setup complete.")


# Ask Questions with RAG Pipeline
from langchain_core.output_parsers import StrOutputParser


context_prompt = ChatPromptTemplate.from_template(
    "Answer the question using ONLY the context below:\n{context}\n\nQuestion: {question}"
)


rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | context_prompt
    | llm
    | StrOutputParser()
)


# === Example Query ===
user_question = "What is the goal of PCA?"
response = rag_chain.invoke(user_question)


print("\n=== RAG Response ===\n")
print(response)

Run with Streamlit UI

To make your RAG pipeline interactive, let's wrap it in a simple Streamlit app. Here's how to get started:

Step 7 : Install Streamlit

Add streamlit to your requirements.txt if not already included:

streamlit

Or install it directly:

pip install streamlit

Step 8 : Create a Streamlit App

Save the following code as pdf_rag_streamlit.py:

import streamlit as st
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import ollama
import os


PDF_FILE = "./data/pca_tutorial.pdf"
EMBED_MODEL = "nomic-embed-text"
LLM_MODEL = "llama3.2"


@st.cache_resource
def load_pdf(path):
    if os.path.exists(path):
        loader = UnstructuredPDFLoader(file_path=path)
        return loader.load()
    else:
        st.error(f"PDF not found at: {path}")
        return []


@st.cache_resource
def build_rag_chain(_documents, embed_model=EMBED_MODEL, llm_model=LLM_MODEL):
    splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=300)
    chunks = splitter.split_documents(_documents)


    ollama.pull(embed_model)


    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=OllamaEmbeddings(model=embed_model),
        collection_name="rag_pdf_collection",
    )


    llm = ChatOllama(model=llm_model)


    multi_query_prompt = PromptTemplate(
        input_variables=["question"],
        template=(
            "You are an AI assistant. Generate 5 different versions of the user question "
            "to improve document retrieval from a vector database.\n"
            "Original question: {question}"
        ),
    )


    retriever = MultiQueryRetriever.from_llm(
        retriever=vector_store.as_retriever(),
        llm=llm,
        prompt=multi_query_prompt,
    )


    context_prompt = ChatPromptTemplate.from_template(
        "Answer the question using ONLY the context below:\n{context}\n\nQuestion: {question}"
    )


    rag_chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | context_prompt
        | llm
        | StrOutputParser()
    )


    return rag_chain


def main():
    st.title("Ask Questions About Your PDF")


    documents = load_pdf(PDF_FILE)
    if not documents:
        return


    rag_chain = build_rag_chain(documents)


    question = st.text_input("Enter your question:")
    if question:
        with st.spinner("Thinking..."):
            try:
                response = rag_chain.invoke(question)
                st.markdown("### Answer")
                st.success(response)
            except Exception as e:
                st.error(f"Error processing your question: {e}")


if __name__ == "__main__":
    main()

Step 9 : Launch the App

streamlit run pdf_rag_streamlit.py

Visit http://localhost:8501 to interact with your PDF using natural language.

And enter your question into the input box to get answers based on the PDF content.

You'll get a clean UI in your browser where you can type questions and receive answers based on the content of your PDF.

Ubuntu

Fedora

CentOS

Debian

Rocky Linux

DevOps

Database

AI/ML

Other