[Local AI with Ollama] Building a RAG System with Ollama and LangChain

This guide walks you through building a Retrieval-Augmented Generation (RAG) system using Ollama for local LLMs and LangChain for orchestration. You'll learn the theory, see a step-by-step workflow, and get hands-on with code—including a Streamlit UI!

1. Introduction to RAG and Its Motivation

In the real world, we often need applications that can answer questions based on our own documents, not just what a language model was trained on. This is where RAG—Retrieval Augmented Generation—comes in.

  • RAG lets you chat with your own data by combining search (retrieval) with generative AI.
  • LLMs are limited by their training data and can "hallucinate" (make things up). RAG fixes this by injecting relevant documents into the prompt, grounding answers in your data.
  • The RAG pipeline:
  1. Indexing: Documents (PDFs, text, etc.) are split into chunks, embedded (turned into vectors), and stored in a vector database.
  2. Retrieval: User queries are embedded, compared, and matched to the most relevant chunks.
  3. Generation: The model receives the query plus retrieved context, generating a grounded response.

Key components needed for RAG:

  • A large language model (LLM)
  • Your document corpus (knowledge base)
  • Embedding models to turn text into vectors
  • A vector database (ChromaDB, Pinecone, etc.)
  • A retrieval mechanism (often built into the database)
  • Orchestration tools like LangChain to tie it all together

LangChain is a popular Python framework that simplifies working with LLMs, document loaders, splitters, embedding models, and vector stores.

2. Vector Stores and Embeddings: The Foundation

Before building, let's review the core concepts:

  • Document Loading: Bring in data from files, URLs, PDFs, databases, etc.
  • Splitting: Break long documents into manageable chunks.
  • Embedding: Convert each chunk into a vector (a list of numbers) that reflects its semantic meaning.
  • Vector Store: A special database (like ChromaDB) that stores both the embedding vectors and original text, making it searchable by meaning.

How does retrieval work?

  • The user's question is also embedded into a vector.
  • The vector store finds the most similar vectors (chunks) by comparing numbers.
  • Similar content in meaning will have similar vectors (“cat” and “kitty” are close together, “cat” and “run” are not).
  • Use cases: semantic search, recommendations, classification.

The full embedding pipeline:

  1. Split documents into chunks.
  2. Create embeddings for each chunk.
  3. Store embeddings + original text in the vector database.
  4. When querying:
  • Embed the question
  • Search for similar vectors
  • Retrieve and pass the best matches to the LLM

3. High-Level Design: What We'll Build

Here's the plan for our simple RAG system:

  • Data: Load a PDF file (e.g., "Beneficial Ownership Information Report").
  • Splitting: Use LangChain's splitters to chunk the PDF.
  • Embedding: Use Ollama's embedding model (e.g., nomic-embed-text) via LangChain wrapper.
  • Vector Store: Store embeddings in ChromaDB.
  • Retrieval: When a user asks a question, use LangChain's MultiQueryRetriever to:
  • Rewrite the question in several ways for better retrieval.
  • Search the vector store for relevant chunks.
  • LLM Generation: Pass the context and query to Ollama's Llama 3 model to generate an answer.

Key point: You can swap out embedding models and LLMs as you wish, thanks to modular design.

4. Step-by-Step: Ingesting and Preparing the Data

Project setup:

  • Place your PDF file (e.g., boi.pdf) in the data/ directory.
  • Create a requirements.txt file with dependencies (LangChain, ChromaDB, Ollama, etc.).
  • Install with:

Python workflow:

  1. Load the PDF:
  2. Use LangChain's unstructured or online PDF loader.
  3. Preview the data:
  4. Print the first 100 words to verify loading works.
  5. Split the text:
  6. Use RecursiveCharacterTextSplitter (chunk size 1200, overlap 300).
  • More overlap = better context retention.
  1. Check chunks:
  2. Print number of chunks and preview; LangChain adds metadata (e.g., source file names).
  3. Embedding and Vector Store:
  4. Use Ollama's nomic-embed-text model (or other) via LangChain, and store in ChromaDB:

5. Retrieval and Generation Pipeline

Retrieval:

  • Use LangChain's MultiQueryRetriever and a custom prompt template to rewrite user questions in multiple ways, maximizing relevant document retrieval.
  • Retrieve the most relevant chunks from the vector database.

Prompting and Chaining:

  • Create a prompt template for the LLM to answer based only on provided context.
  • Use LangChain's chaining to tie together retrieval, prompt assembly, and LLM generation.
  • Parse outputs for clean presentation.

Example queries:

  • "What is the document about?"
  • "What are the main points as a business owner I should be aware of?"
  • "How do I file for beneficial ownership information?"

All processing is done locally—no API fees, no data leaves your computer!

6. Refactoring: Clean and Modular Code

To make your code maintainable and reusable:

  • Split logic into functions:
  • PDF ingestion
  • Splitting
  • Embedding and vector DB creation
  • Retriever setup
  • Chain creation
  • Use configuration variables/constants at the top.
  • The main routine simply instantiates components and runs your questions.

This makes it easy to swap files, models, or add new features later.

7. Adding a Streamlit User Interface

It's even better to have a user-friendly UI!

With a few lines of Streamlit code, you can:

  • Display a web-based chat interface
  • Let users input questions
  • Show answers interactively

Key additions:

  • At the bottom of your main script, add Streamlit logic:
  • All the heavy lifting (ingestion, retrieval, LLM call) stays the same—just the entry point and output are Streamlit widgets.

Result:

A local, private, flexible "chat with your docs" app—no cloud required!

Conclusion

You now have a full pipeline for RAG with Ollama and LangChain:

  • Understand embeddings, vector stores, and retrieval
  • Ingest and process any PDF or document
  • Swap models/embeddings easily
  • Add a friendly UI with Streamlit

Feel free to use, adapt, and expand the code for your own use cases!