RAG-powered conversational commerce widget

The brief

A shopping surface that worked beautifully for users who knew exactly what they wanted, and failed almost entirely for users who didn't. The catalog was rich, the filters were well-built, and the search bar was a keyword-matching system from a decade ago. Typing “something to wear to a beach wedding in October” got you either an empty page or a flood of unrelated results.

I built the conversational layer end-to-end — the retrieval pipeline, the React widget, the streaming UI, and the evaluation harness — as a prototype the team could iterate on instead of a vendor box we couldn't open.

Hybrid retrieval pipeline. The LLM picks from a constrained set; it never invents the set.

What I shipped

The widget. A streaming chat surface that renders LLM tokens as they arrive, with product cards that hydrate progressively from retrieval results. Suspense boundaries for the retrieval phase, streamed text from a route handler, and cancellation handling so a follow-up query doesn't race the previous one.

Hybrid retrieval. Dense embeddings over product descriptions plus sparse BM25 over structured attributes (category, colour, season, price band). Two signals catch different things — semantic similarity vs. exact attribute match — and neither is sufficient alone.

HyDE query rewriting. For vague queries, the system generates a hypothetical product description first, then retrieves against that. Cost: one extra LLM call. Payoff: dramatic recall improvement on intent-style queries that don't share vocabulary with the catalog.

Reciprocal rank fusion. The dense and sparse result lists are merged by rank, not by score — robust to the scale differences between the two systems.

Constrained generation. The LLM only ever sees retrieved products and is prompted to recommend from that set, with reasoning. No free-form product invention. Every recommendation cites the product ID it came from, which is what makes the whole pipeline evaluable.

Recall@10 across query types — sparse-only baseline vs. hybrid retrieval with HyDE. The win is in the queries users actually struggle with.

Stack

Embeddingstext-embedding-3-large (OpenAI)3072-dim, evaluated against e5-large-v2
Vector storePostgres + pgvectorHNSW index, same DB as the catalog
Sparse indexElasticsearch · BM25tuned on category + attribute fields
LLM (gen)Claude 3.5 Sonnetconstrained to retrieved-set selection
LLM (HyDE)Claude 3.5 Haikucheaper, fast enough for rewrite
FrontendReact · Next.js · SSE streamingReadableStream + AbortController
Eval harnessCustom golden-set runnerrecall@k, MRR, citation accuracy

Tradeoffs chosen for evaluability — every layer is swappable behind its contract.

What this unlocked

A streaming surface that felt like talking to a stylist — not filling in a form.
A model-agnostic retrieval stack — swapping the embedding model or the LLM is a configuration change, not a re-architecture.
A clear separation between the retrieval-quality problem and the generation-quality problem, so the team could improve each independently.
A demo that turned a vague product brief into something stakeholders could actually steer.

Lessons I keep coming back to

The hard part of a RAG system isn't the LLM. It's the retrieval pipeline, the streaming UI, and the evaluation harness.

Hybrid retrieval beats either side alone. Dense embeddings catch semantic similarity; sparse retrieval catches exact attribute matches. Production systems need both.
HyDE is a cheat code for vague queries. Letting the LLM rewrite the query before retrieval is one of the highest-leverage moves in the pipeline.
Constrain the generation layer. The LLM should pick from a set, not invent the set. The moment you let it generate freely, you've given up the ability to evaluate.
Streaming UIs are a UX problem. Tokens arriving one at a time is the easy part. Cancellation, partial product cards, late-arriving retrieval results, follow-up queries that race the previous one — that's where the React work lives.
Evaluation harnesses are the deliverable. A RAG system without a golden set is a vibes-driven engineering project. With one, every improvement is a measurable change against a known baseline.

RAG-powered conversational commerce widget

The brief

What I shipped

What this unlocked

Lessons I keep coming back to

A hybrid mobile app that doesn’t feel like one

Rebuilding an ads delivery SDK around Prebid