This guide walks you through building a Retrieval-Augmented Generation (RAG) system using Ollama for local LLMs and LangChain for orchestration. You'll learn the theory, see a step-by-step workflow, and get hands-on with code—including a Streamlit UI!
1. Introduction to RAG and Its Motivation
In the real world, we often need applications that can answer questions based on our own documents, not just what a language model was trained on. This is where RAG—Retrieval Augmented Generation—comes in.
- RAG lets you chat with your own data by combining search (retrieval) with generative AI.
- LLMs are limited by their training data and can "hallucinate" (make things up). RAG fixes this by injecting relevant documents into the prompt, grounding answers in your data.
- The RAG pipeline:
- Indexing: Documents (PDFs, text, etc.) are split into chunks, embedded (turned into vectors), and stored in a vector database.
- Retrieval: User queries are embedded, compared, and matched to the most relevant chunks.
- Generation: The model receives the query plus retrieved context, generating a grounded response.
Key components needed for RAG:
- A large language model (LLM)
- Your document corpus (knowledge base)
- Embedding models to turn text into vectors
- A vector database (ChromaDB, Pinecone, etc.)
- A retrieval mechanism (often built into the database)
- Orchestration tools like LangChain to tie it all together
LangChain is a popular Python framework that simplifies working with LLMs, document loaders, splitters, embedding models, and vector stores.
2. Vector Stores and Embeddings: The Foundation
Before building, let's review the core concepts:
- Document Loading: Bring in data from files, URLs, PDFs, databases, etc.
- Splitting: Break long documents into manageable chunks.
- Embedding: Convert each chunk into a vector (a list of numbers) that reflects its semantic meaning.
- Vector Store: A special database (like ChromaDB) that stores both the embedding vectors and original text, making it searchable by meaning.
How does retrieval work?
- The user's question is also embedded into a vector.
- The vector store finds the most similar vectors (chunks) by comparing numbers.
- Similar content in meaning will have similar vectors (“cat” and “kitty” are close together, “cat” and “run” are not).
- Use cases: semantic search, recommendations, classification.
The full embedding pipeline:
- Split documents into chunks.
- Create embeddings for each chunk.
- Store embeddings + original text in the vector database.
- When querying:
- Embed the question
- Search for similar vectors
- Retrieve and pass the best matches to the LLM
3. High-Level Design: What We'll Build
Here's the plan for our simple RAG system:
- Data: Load a PDF file (e.g., "Beneficial Ownership Information Report").
- Splitting: Use LangChain's splitters to chunk the PDF.
- Embedding: Use Ollama's embedding model (e.g.,
nomic-embed-text
) via LangChain wrapper. - Vector Store: Store embeddings in ChromaDB.
- Retrieval: When a user asks a question, use LangChain's
MultiQueryRetriever
to: - Rewrite the question in several ways for better retrieval.
- Search the vector store for relevant chunks.
- LLM Generation: Pass the context and query to Ollama's Llama 3 model to generate an answer.
Key point: You can swap out embedding models and LLMs as you wish, thanks to modular design.
4. Step-by-Step: Ingesting and Preparing the Data
Project setup:
- Place your PDF file (e.g.,
boi.pdf
) in thedata/
directory. - Create a
requirements.txt
file with dependencies (LangChain, ChromaDB, Ollama, etc.). - Install with:
Python workflow:
- Load the PDF:
- Use LangChain's unstructured or online PDF loader.
- Preview the data:
- Print the first 100 words to verify loading works.
- Split the text:
- Use
RecursiveCharacterTextSplitter
(chunk size 1200, overlap 300).
- More overlap = better context retention.
- Check chunks:
- Print number of chunks and preview; LangChain adds metadata (e.g., source file names).
- Embedding and Vector Store:
- Use Ollama's
nomic-embed-text
model (or other) via LangChain, and store in ChromaDB:
5. Retrieval and Generation Pipeline
Retrieval:
- Use LangChain's
MultiQueryRetriever
and a custom prompt template to rewrite user questions in multiple ways, maximizing relevant document retrieval. - Retrieve the most relevant chunks from the vector database.
Prompting and Chaining:
- Create a prompt template for the LLM to answer based only on provided context.
- Use LangChain's chaining to tie together retrieval, prompt assembly, and LLM generation.
- Parse outputs for clean presentation.
Example queries:
- "What is the document about?"
- "What are the main points as a business owner I should be aware of?"
- "How do I file for beneficial ownership information?"
All processing is done locally—no API fees, no data leaves your computer!
6. Refactoring: Clean and Modular Code
To make your code maintainable and reusable:
- Split logic into functions:
- PDF ingestion
- Splitting
- Embedding and vector DB creation
- Retriever setup
- Chain creation
- Use configuration variables/constants at the top.
- The main routine simply instantiates components and runs your questions.
This makes it easy to swap files, models, or add new features later.
7. Adding a Streamlit User Interface
It's even better to have a user-friendly UI!
With a few lines of Streamlit code, you can:
- Display a web-based chat interface
- Let users input questions
- Show answers interactively
Key additions:
- At the bottom of your main script, add Streamlit logic:
- All the heavy lifting (ingestion, retrieval, LLM call) stays the same—just the entry point and output are Streamlit widgets.
Result:
A local, private, flexible "chat with your docs" app—no cloud required!
Conclusion
You now have a full pipeline for RAG with Ollama and LangChain:
- Understand embeddings, vector stores, and retrieval
- Ingest and process any PDF or document
- Swap models/embeddings easily
- Add a friendly UI with Streamlit
Feel free to use, adapt, and expand the code for your own use cases!