RAG Applications: What I've Learned So Far

Welcome, Developer 👋

I want to be upfront about something: I’m not writing this from the other side of a finished, polished, production-grade RAG system. I’m writing this from the middle of it. Months into studying the pattern and building a proof of concept, with enough understanding now that I feel like I can explain it without embarrassing myself.

That’s actually why I wanted to write this post. Most RAG content out there is either too shallow (“it retrieves stuff and gives it to the LLM, next!”) or written by people who assume you’re already deep in the ML world. I came at this from a product angle, not a research angle. And I think that perspective is worth something.

So here’s what I know. Some of it I learned the hard way. Some of it I’m still figuring out.

What RAG Actually Is

RAG stands for Retrieval-Augmented Generation. The name is a mouthful but the idea is straightforward once it clicks.

LLMs (the models behind ChatGPT, Claude, all of them) are trained on a massive snapshot of the world up to a certain date. After training, they’re frozen. They don’t know what happened last week. They don’t know your internal docs. They don’t know the content of your database. Ask them about any of that and they’ll either make something up, which is called hallucinating, or give you a vague, generic answer that’s technically not wrong but completely useless.

RAG fixes this by adding a step before the model generates a response. Instead of asking the LLM “what do you know about X?”, you first go and retrieve relevant information from your own data sources, then hand that to the LLM and say: here’s some context, now answer the question.

The result is an AI that can answer questions grounded in your actual data. More accurate. Up to date. And you can trace the answer back to a source, which matters a lot when someone asks “where did that come from?”

The thing that made this easy to understand for me was thinking about it as giving the model a cheat sheet before the exam. The model is still doing the reasoning. You’re just making sure it has the right material in front of it.

Why It Matters (and Why I Started Looking Into It)

I started exploring RAG because I was scoping a proof of concept. An AI assistant that could answer questions based on a specific knowledge base. The kind of feature where a user asks something in plain language and gets a specific, accurate answer back, not a generic one.

The first thing I tried was stuffing the relevant content directly into the prompt. That works up to a point. Then you hit context window limits. And costs start adding up fast when you’re sending thousands of tokens with every request.

Fine-tuning the model was the other obvious option. But fine-tuning is expensive, time-consuming, and the moment your data changes you have to do it again. Not practical for most product teams.

RAG is the middle ground. You don’t retrain the model. You don’t stuff everything into the prompt. You retrieve only what’s relevant, inject it, and get a grounded answer. It’s why it’s become the default pattern for most enterprise AI applications.

How It Works

There are two phases: getting your data ready (ingestion), and actually answering questions (retrieval). I’ll walk through both, developer.

Getting Your Data Ready

This happens offline, before any user ever asks a question. Think of it as building the library.

Collect your documents. Your knowledge base can be anything: PDFs, markdown files, Confluence pages, database records, API responses. Whatever is relevant to the questions your app needs to answer.

Chunk them. You can’t feed an entire PDF to a vector database. You split documents into smaller pieces called chunks, usually a paragraph or a few hundred tokens each. The chunking strategy matters more than I initially thought. Too small and you lose context. Too large and your retrieval gets noisy. I’ve been experimenting with 512 tokens and 50-token overlap as a starting point.

Generate embeddings. Each chunk gets converted into a vector, which is basically a list of numbers that represents the semantic meaning of the text. This is done by an embedding model. The interesting thing is that semantically similar text ends up with numerically similar vectors, even if the words are completely different. “How do I cancel my account” and “I want to stop my subscription” end up close together. That’s the magic.

Store them in a vector database. The vectors and the original text go into a vector database. These databases are built specifically for similarity search, which is what powers the retrieval step. I wrote a whole post on how vector databases work under the hood and how to pick one. Check out Vector Databases Explained if you want to go deeper on that part before continuing here.

Answering the Question

Now a user types something. Here’s what happens:

The user’s question gets converted into a vector using the same embedding model. Same model, same vector space. That’s important.

The vector database compares that query vector against everything it has stored and returns the top N most similar chunks. This is the retrieval step.

Those chunks get injected into the prompt alongside the question. The LLM sees something like:

You are a helpful assistant. Use the following context to answer the question.

Context:
[retrieved chunk 1]
[retrieved chunk 2]
[retrieved chunk 3]

Question: How do I reset my password?

Answer:

Then the LLM generates a response grounded in that context. It’s not guessing anymore. It’s reading.

That’s the full loop. I kept waiting for it to be more complicated than that. It isn’t. The complexity is in making it work well, which I’ll get to.

The Architecture, Roughly

If you’re thinking about building this, here’s the shape of what you’re dealing with:

DATA PIPELINE (runs offline)
Raw Docs → Chunker → Embedding Model → Vector DB

QUERY PIPELINE (runs per request)
User Question → Embedding Model → Vector DB Search
  → Prompt Builder → LLM → Response

In a real implementation you’ll also end up wanting a reranker. It’s a secondary model that re-scores retrieved chunks before they go into the prompt and it makes a noticeable difference in response quality. And metadata filtering, so you can restrict retrieval based on document type, date, or user permissions.

I haven’t built all of these yet in my POC. But I’ve hit the walls that make you understand why they exist.

What You Can Actually Build With This

The use cases that made this feel real to me as I was studying it:

Support chatbots that answer from your actual documentation, not from generic training data. When the docs change, the bot knows. No retraining.

Internal assistants for HR policies, engineering runbooks, compliance documents. Employees ask in plain language, get accurate answers with a source they can check.

Document Q&A. Instead of reading a 200-page contract, ask a question and get the relevant section back.

Search that understands meaning. Not keyword matching, but semantic retrieval. “Vehicle failure on the highway” finds the article titled “Car breakdown procedures” even though there’s no word overlap.

The pattern is always the same. Knowledge base in, question in, grounded answer out.

RAG vs. Fine-Tuning vs. Just Prompting

This is the question I spent a lot of time on before I understood where each one belongs.

Method	When it makes sense
Prompt engineering	Quick behaviour changes. No data required. Best for tone, format, simple instructions.
RAG	Your data changes often, you need source attribution, or you want domain-specific answers without the cost of training.
Fine-tuning	You want the model to write in a specific style or deeply understand domain language.
Pre-training	You’re a well-funded company with a research team and billions of tokens of proprietary data.

For most product teams building on top of existing LLMs, RAG is where you start. It’s the fastest path from “we want an AI feature” to something that actually works. Fine-tuning comes later, if ever, and when you do fine-tune you can still layer RAG on top.

The Things That Are Harder Than They Look

I want to be honest about the parts that tripped me up, because most posts gloss over them.

Retrieval quality is the real bottleneck. You can have the best LLM in the world and still get bad answers if you’re retrieving the wrong chunks. The chunking strategy, the embedding model, whether you add a reranker. All of this matters more than which LLM you pick.

Chunk size is not a solved problem. I’ve read a dozen different recommendations. The honest answer is that it depends on your documents and your use case. 512 tokens with overlap is a reasonable starting point. You’ll likely need to tune it.

Stale data will sneak up on you. The vector database reflects the state of your documents at the time they were ingested. If the docs change and you don’t re-ingest, the answers go stale. Build the refresh pipeline early, not as an afterthought.

Evaluation is genuinely hard. How do you know if your RAG pipeline is actually good? You need to measure retrieval quality (did you retrieve the right chunks?) and generation quality (is the answer faithful to what was retrieved?). RAGAS is the tool most people use for this. I’d set it up before you think you need it.

The Tools I’ve Been Looking At

I’m not going to pretend I’ve used all of these in depth. But based on what I’ve read and what I’ve experimented with:

For building the pipeline, LangChain and LlamaIndex are the two you’ll see everywhere. LangChain has the largest community and the most integrations. LlamaIndex is more focused specifically on the data and retrieval side, and the abstractions feel cleaner for RAG specifically. I’ve been using LlamaIndex for my POC and it’s the one I’d recommend starting with.

For vector storage, I started with Chroma because setup is almost trivially easy, great for local development and for learning the pattern without fighting infrastructure. I covered the broader landscape including Pinecone, Qdrant, Weaviate, and pgvector in Vector Databases Explained if you’re trying to make that call.

For evaluation, RAGAS. Metrics for context precision, recall, faithfulness, answer relevancy. Run it against your pipeline before you call anything done.

My starting point recommendation if you’re new to this: LlamaIndex + Chroma + RAGAS. Low friction, well documented, and you can swap parts as you understand what you actually need.

Let me show you

Before running the code, you need Ollama installed and two models pulled. That’s it, no API keys, no accounts, runs entirely on your machine.

# Install Ollama from https://ollama.com, then:
ollama pull llama3.2        # the LLM that generates the answer
ollama pull nomic-embed-text  # the embedding model

Then install the Python dependencies:

pip install llama-index llama-index-llms-ollama llama-index-embeddings-ollama llama-index-vector-stores-chroma chromadb

Here’s the full pipeline:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
import chromadb
 
# Point LlamaIndex at your local Ollama models
Settings.llm = Ollama(model="llama3.2", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)
 
# Load your documents from a local folder
documents = SimpleDirectoryReader("./docs").load_data()
 
# Set up Chroma as the vector store, persisted to disk
chroma_client = chromadb.PersistentClient(path="./chroma_db")
 
# Reset the collection if you want to re-index from scratch
# (remove these two lines once your docs are stable)
chroma_client.delete_collection("my-knowledge-base")
chroma_collection = chroma_client.get_or_create_collection("my-knowledge-base")
 
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
 
# Build the index — chunks, embeds, and stores everything in one call
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
)
 
# Ask something
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What is this content about?")
 
print(response.response)

Save that as rag.py. Your project structure should look like this:

my-rag-project/
├── rag.py
├── chroma_db/        # created automatically on first run
└── docs/
    └── your-file.txt  # put your documents here - txt, md, pdf, etc.

Then run it:

python3 rag.py  # macOS / Linux
python rag.py   # Windows

The ./docs folder is your knowledge base. Put in there whatever you want the model to answer questions about. For this example, that means any documentation related to a subject — a markdown file explaining the steps, a PDF from your internal wiki, a plain text file with the support team’s notes. The question at the bottom ("What is this content about?") will be answered based on whatever you put in that folder. Change the documents, change the question, same code.

LlamaIndex handles the chunking, embedding, and retrieval. Chroma persists to disk so you don’t re-index on every run. And because everything goes through Ollama, it’s completely free and works offline.

Real production code needs error handling, caching, proper ingestion pipelines, and observability on top of this. But this is the loop, and once you see it running the whole thing makes a lot more sense.

Where This Is Going

A few things I’ve been reading about that seem worth watching:

GraphRAG builds a knowledge graph from your documents instead of treating them as isolated chunks. Better for questions that need to connect information across multiple sources.

Agentic RAG goes further. Instead of one retrieval step, the model decides what to search for, evaluates the results, searches again if needed, and reasons across multiple sources. More powerful, more moving parts.

Hybrid search combines semantic vector search with keyword-based (BM25) search. Semantic understanding and exact term matching in the same query. This one is already pretty widely used and I want to explore it more in my POC.

I’ll probably write about these as I dig into them. Still learning.

Conclusion

RAG is how you give an LLM access to your specific data without retraining it. You chunk your documents, turn them into vectors, store them, and at query time you retrieve the most relevant pieces and include them in the prompt. The model answers based on what you gave it, not just what it was trained on.

It’s the most useful pattern I’ve come across for building real AI features on top of existing models. I’m still figuring out the finer points: chunking strategy, reranking, evaluation. But the core is simple enough that you can build something working pretty quickly.

If you’re exploring it too, I’d genuinely love to hear what you’re building.

Stay focused, Developer!