Most RAG Fails Because Teams Skip Search Basics

Most RAG projects fail for a reason nobody wants to hear. The model is fine. The search underneath it is broken. A team wires up embeddings, gets weak answers, then decides the problem is the LLM and goes shopping for a bigger one. That is how budgets disappear with nothing to show. We fixed that pattern, and the fix is not a model. It is search basics done in the right order.
This is the guide we actually run. It is built on a repo that works in the real world and adapted for your internal databases. None of it is exotic. That is the point.
The Stack That Ships Fast and Stays Stable
When a founder asks me what to run, I give the same short list every time. A FastAPI service for the API. OpenSearch for BM25 and filters, which means exact keyword matching and the ability to scope results by who is asking. Embeddings for semantic lift, but only when they actually help. PostgreSQL for metadata and joins, so structured facts stay structured. Airflow for the pipelines that move data in. And in production, a Redis cache plus Langfuse traces so you can see every step.
That combo ships fast and stays stable. The founder consequence is simple. You are not betting the quarter on a research-grade architecture that one engineer understands. You are running boring, well-known parts that your team can hit in minutes and your next hire already knows. Stable infrastructure is cheaper than clever infrastructure, every time.
How We Build It in One Sprint
The whole thing fits in a single sprint because we refuse to gold-plate. Here is the sequence.
- Infra ready. Docker up, health checks green, docs live. The team can hit endpoints within minutes, not after a week of setup tickets.
- Ingestion from your sources. Pick one internal source first. CRM, ERP, warehouse, Confluence, SharePoint, or S3. Airflow extracts, cleans, and normalizes it to a common document schema. One source, done properly, beats six sources done halfway.
- Index two ways. Exact match with BM25 for precise lookups. Chunks with embeddings for meaning. Running both is hybrid search, and hybrid wins coverage without the hallucinations you get when meaning-search runs alone.
- Retrieval logic that is boring and reliable. If the question is numeric or about IDs, use SQL or BM25. If it is fuzzy, use hybrid. Always add filters from the user context, so people only see what they are allowed to see.
- Answering that cites sources. Compose with the model only after retrieval. Show links and passages. No naked generations, ever.
- Production guardrails. Cache frequent queries. Trace every step. Log the misses and add new documents weekly.
Notice what carries the weight here. The model shows up at step five, after the search has already found the right material. By then the hard work is done. The LLM is just writing a clean answer over facts you already trust.
Why BM25 Leading Beats a Smarter Model
The single most important call in this whole design is letting BM25 lead. Keyword search is precise, fast, and explainable. When someone asks for an invoice number, a part code, or a customer ID, you do not want a probabilistic guess from an embedding. You want the exact row. Embeddings earn their place on fuzzy questions where meaning matters more than the literal words, and not before.
This ordering is what kills hallucinations at the root. The model never has to invent anything, because retrieval handed it the real passage with a link attached. For a founder, that turns into three concrete outcomes. Lower latency on the common questions your team asks all day. Fewer wrong answers, because the precise search runs first. And a clear playbook to plug any internal database into the same pipeline, so the second system costs a fraction of the first.
The Starter Checklist
If you want to hand your team a one-page plan, this is it. Keep each rule to one thing so it survives contact with a real backlog.
- One schema for documents. Title, body, source, timestamps, access scope. Nothing more.
- One ingestion DAG per system. Small and testable, so a broken source never takes down the rest.
- One hybrid index per domain. Keep the chunk size consistent, or your results drift.
- One routing rule for tool choice. BM25 first, hybrid when needed.
- One cache policy. Set the TTL by business value, not by habit.
- One review loop. Add missed documents, update synonyms, and retrain embeddings only when the metrics stall.
That last point is where most teams overspend. Retraining embeddings is the expensive move, so you only reach for it when the numbers actually stop improving. Everything cheaper comes first.
The Takeaway
If your RAG is underperforming, resist the urge to buy a bigger model. Look at the search layer instead. Get BM25 leading, add embeddings only where they pay off, cite every source, and log what you miss. Do that and you get a system that answers faster, lies less, and copies cleanly onto the next database. The decision in front of you is not which model to license. It is whether you are willing to fix the boring parts first.
If you want a fast read on where you stand before you spend anything, run a free AI readiness audit of your site at https://readiness.ai4.sale. It takes a few minutes and gives you a concrete picture of what is ready and what is not, which is the right place to start before you build.

















