2025-12-20
RAG Systems - From Concept to Production
RAG is not one magic vector DB. We walk through chunking text, picking embeddings, and what to measure when answers still sound confident and wrong.
9 min read
RAG Systems: From Concept to Production
Retrieval-augmented generation sounds like one feature flag: turn on vectors, attach OpenAI, ship. In production it is a pipeline: how you cut documents, how you embed them, how you search, and how you ask the model to stay inside the lines. This article walks through those pieces and what breaks when you skip straight from demo to customers.
Ingestion
Chunking is a tradeoff. Tiny chunks retrieve precisely but lose context; huge chunks pollute the prompt. Start with clear boundaries (sections, paragraphs) and overlap if users ask questions that span two chunks.
Embeddings should be the same model at index time and query time. Store vectors in something built for search, whether that is hosted Pinecone, pgvector in Postgres, or another store your team can operate.
Metadata (source file, page, product id) lets you filter before vector search and cite answers in the UI. Without it, users get confident paragraphs with no footnotes.
Retrieval
Pull top-k neighbors by similarity, then tune k and chunk size against real questions. Reranking with a second model can help when top-k is noisy. Hybrid search (keyword plus vector) helps when users type SKUs, names, or error codes that embeddings alone miss.
Generation
The prompt should include the retrieved chunks and the user question, and it should tell the model to stay within that text and to say when it cannot answer. Citations in the UI come from metadata you stored, not from the model making up filenames.
Production
Watch latency (embedding plus LLM), cost (tokens in and out), and quality (human spot checks or evals). RAG fails quietly: answers look fluent while being wrong. You fix that with better chunks, better retrieval, and honest “I don’t know” behavior.
We build RAG pipelines with LangChain, vector stores, and the boring eval work that turns a demo into something you can support.
Cogent Softwares, AI and full-stack development.