[Local AI with Ollama] Building a Voice-Enabled RAG System with Ollama and ElevenLabs

In this article, we'll build a complete Voice-Enabled RAG (Retrieval-Augmented Generation) system using a sample document, pca_tutorial.pdf. The pipeline is similar to classic RAG demos, but now with a new component—voice audio response! We'll use Ollama with LLM/embeddings, ChromaDB for vector storage, LangChain for orchestration, and ElevenLabs for text-to-speech audio output.

Key steps:

Load and split pca_tutorial.pdf
Generate embeddings and store in ChromaDB
Retrieve and answer questions with Ollama LLM
Convert the LLM answer to speech using ElevenLabs

Getting the ElevenLabs API Key

To enable voice output, you'll need an ElevenLabs API key (they offer free credit).

Go to your dashboard, click on "Your profile", and select "API Key".

Create a new key, name it, and copy it securely

You'll store this API key in a .env file for your project:

ELEVENLABS_API_KEY=your_actual_api_key_here

The Pipeline: Loading, Splitting, Embedding, and Retrieval with pca_tutorial

Let's walk through the main pipeline using your pca_tutorial.pdf.

Load and Split pca_tutorial.pdf

Use UnstructuredPDFLoader from langchain_community to load pca_tutorial.pdf, then split the document into overlapping chunks for embedding.

from langchain_community.document_loaders import UnstructuredPDFLoader
import os

pdf_file = "./data/pca_tutorial.pdf"
embedding_model_name = "nomic-embed-text"
llm_model_name = "llama3.2"

# Load and Ingest a PDF
if os.path.exists(pdf_file):
    pdf_loader = UnstructuredPDFLoader(file_path=pdf_file)
    documents = pdf_loader.load()
    print("[✓] PDF file loaded successfully.")
else:
    raise FileNotFoundError("PDF file not found. Please upload a valid file.")

# Split the PDF into Chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=300)
doc_chunks = splitter.split_documents(documents)
print(f"[✓] Document split into {len(doc_chunks)} chunks.")

Embed Chunks into Chroma Vector Database

Use Ollama's embedding model to convert chunks into vectors and store them in a ChromaDB collection.

from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
import ollama



print("Pulling embedding model...")
ollama.pull(embedding_model_name)



vector_store = Chroma.from_documents(
    documents=doc_chunks,
    embedding=OllamaEmbeddings(model=embedding_model_name),
    collection_name="pca_tutorial_collection",
)
print("[✓] Chunks embedded and stored in vector database.")

Setup the Retriever with Multi-Query

Use MultiQueryRetriever for robust question rephrasing and retrieval, boosting RAG performance.

from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_core.runnables import RunnablePassthrough

llm = ChatOllama(model=llm_model_name)

multi_query_prompt = PromptTemplate(
    input_variables=["question"],
    template=(
        "You are an AI assistant. Generate 5 different versions of the user question "
        "to improve document retrieval from a vector database.\n"
        "Original question: {question}"
    ),
)

retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(),
    llm=llm,
    prompt=multi_query_prompt,
)

print("[✓] Retriever with multi-query setup complete.")

Ask Questions with RAG Chain

Build a RAG chain with a context-injection prompt, and get high-quality answers from your LLM using the retrieved context.

from langchain_core.output_parsers import StrOutputParser

context_prompt = ChatPromptTemplate.from_template(
    "Answer the question using ONLY the context below:\n{context}\n\nQuestion: {question}"
)

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | context_prompt
    | llm
    | StrOutputParser()
)

user_question = "What is the goal of PCA?"
response = rag_chain.invoke(user_question)
print("[✓] Finished answering the question.")

Generate and Play Voice Output with ElevenLabs

Take the LLM's answer, synthesize it with ElevenLabs, and play it directly.

import os
from dotenv import load_dotenv
from elevenlabs import VoiceSettings, stream
from elevenlabs.client import ElevenLabs


# Load environment variables from .env file
load_dotenv()
api_key = os.getenv("ELEVENLABS_API_KEY")


# Initialize ElevenLabs client
client = ElevenLabs(api_key=api_key)


# Generate audio stream from text
audio_stream = client.text_to_speech.stream(
    voice_id="pNInz6obpgDQGcFmaJgB",  # Adam voice ID (pre-made voice)
    output_format="mp3_22050_32",     # Compressed MP3 format
    text=response,
    model_id="eleven_multilingual_v2",
    voice_settings=VoiceSettings(
        stability=0.0,
        similarity_boost=1.0,
        style=0.0,
        use_speaker_boost=True,
        speed=1.0,
    )
)


print("[✓] Audio stream generated.")


# Play the audio stream directly through speakers
stream(audio_stream)

You could also save the audio stream to a file for future playback.

Full Example Code

Below is the complete code for building your Voice-Enabled RAG System using pca_tutorial.pdf with Ollama, ChromaDB, LangChain, and ElevenLabs.

from langchain_community.document_loaders import UnstructuredPDFLoader
import os

pdf_file = "./data/pca_tutorial.pdf"
embedding_model_name = "nomic-embed-text"
llm_model_name = "llama3.2"

# Load and Ingest a PDF
if os.path.exists(pdf_file):
    pdf_loader = UnstructuredPDFLoader(file_path=pdf_file)
    documents = pdf_loader.load()
    print("[✓] PDF file loaded successfully.")
else:
    raise FileNotFoundError("PDF file not found. Please upload a valid file.")

# Split the PDF into Chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=300)
doc_chunks = splitter.split_documents(documents)
print(f"[✓] Document split into {len(doc_chunks)} chunks.")

# Embed Chunks into Vector Database
from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
import ollama

print("Pulling embedding model...")
ollama.pull(embedding_model_name)

vector_store = Chroma.from_documents(
    documents=doc_chunks,
    embedding=OllamaEmbeddings(model=embedding_model_name),
    collection_name="pca_tutorial_collection",
)
print("[✓] Chunks embedded and stored in vector database.")

#  Setup the Retriever with Multi-Query
from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_core.runnables import RunnablePassthrough

llm = ChatOllama(model=llm_model_name)

multi_query_prompt = PromptTemplate(
    input_variables=["question"],
    template=(
        "You are an AI assistant. Generate 5 different versions of the user question "
        "to improve document retrieval from a vector database.\n"
        "Original question: {question}"
    ),
)

retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(),
    llm=llm,
    prompt=multi_query_prompt,
)

print("[✓] Retriever with multi-query setup complete.")

# Ask Questions with RAG Pipeline
from langchain_core.output_parsers import StrOutputParser

context_prompt = ChatPromptTemplate.from_template(
    "Answer the question using ONLY the context below:\n{context}\n\nQuestion: {question}"
)

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | context_prompt
    | llm
    | StrOutputParser()
)

user_question = "What is the goal of PCA?"
response = rag_chain.invoke(user_question)
print("[✓] Finished answering the question.")

import os
from dotenv import load_dotenv
from elevenlabs import VoiceSettings, stream
from elevenlabs.client import ElevenLabs

# Load environment variables from .env file
load_dotenv()
api_key = os.getenv("ELEVENLABS_API_KEY")

# Initialize ElevenLabs client
client = ElevenLabs(api_key=api_key)

# Generate audio stream from text
audio_stream = client.text_to_speech.stream(
    voice_id="pNInz6obpgDQGcFmaJgB",  # Adam voice ID (pre-made voice)
    output_format="mp3_22050_32",     # Compressed MP3 format
    text=response,
    model_id="eleven_multilingual_v2",
    voice_settings=VoiceSettings(
        stability=0.0,
        similarity_boost=1.0,
        style=0.0,
        use_speaker_boost=True,
        speed=1.0,
    )
)

print("[✓] Audio stream generated.")

# Play the audio stream directly through speakers
stream(audio_stream)

With just a few tools and APIs, we've built a complete voice-enabled RAG system. This setup bridges text and speech, making AI interactions more dynamic and intuitive.

Ubuntu

Fedora

CentOS

Debian

Rocky Linux

DevOps

Database

AI/ML

Other