2025 · Senior Engineer · GenAI / Frontend
RAG-powered conversational commerce widget
Built a streaming React widget on top of a hybrid retrieval pipeline — dense + sparse + RRF, HyDE rewriting, constrained generation — that turns vague shopping intent into a curated, explainable product set.
- GenAI
- RAG
- React
- Streaming UI
- Embeddings
The brief
A shopping surface that worked beautifully for users who knew exactly what they wanted, and failed almost entirely for users who didn't. The catalog was rich, the filters were well-built, and the search bar was a keyword-matching system from a decade ago. Typing “something to wear to a beach wedding in October” got you either an empty page or a flood of unrelated results.
I built the conversational layer end-to-end — the retrieval pipeline, the React widget, the streaming UI, and the evaluation harness — as a prototype the team could iterate on instead of a vendor box we couldn't open.
What I shipped
The widget. A streaming chat surface that renders LLM tokens as they arrive, with product cards that hydrate progressively from retrieval results. Suspense boundaries for the retrieval phase, streamed text from a route handler, and cancellation handling so a follow-up query doesn't race the previous one.
Hybrid retrieval. Dense embeddings over product descriptions plus sparse BM25 over structured attributes (category, colour, season, price band). Two signals catch different things — semantic similarity vs. exact attribute match — and neither is sufficient alone.
HyDE query rewriting. For vague queries, the system generates a hypothetical product description first, then retrieves against that. Cost: one extra LLM call. Payoff: dramatic recall improvement on intent-style queries that don't share vocabulary with the catalog.
Reciprocal rank fusion. The dense and sparse result lists are merged by rank, not by score — robust to the scale differences between the two systems.
Constrained generation. The LLM only ever sees retrieved products and is prompted to recommend from that set, with reasoning. No free-form product invention. Every recommendation cites the product ID it came from, which is what makes the whole pipeline evaluable.
Stack
- Embeddingstext-embedding-3-large (OpenAI)3072-dim, evaluated against e5-large-v2
- Vector storePostgres + pgvectorHNSW index, same DB as the catalog
- Sparse indexElasticsearch · BM25tuned on category + attribute fields
- LLM (gen)Claude 3.5 Sonnetconstrained to retrieved-set selection
- LLM (HyDE)Claude 3.5 Haikucheaper, fast enough for rewrite
- FrontendReact · Next.js · SSE streamingReadableStream + AbortController
- Eval harnessCustom golden-set runnerrecall@k, MRR, citation accuracy
What this unlocked
- A streaming surface that felt like talking to a stylist — not filling in a form.
- A model-agnostic retrieval stack — swapping the embedding model or the LLM is a configuration change, not a re-architecture.
- A clear separation between the retrieval-quality problem and the generation-quality problem, so the team could improve each independently.
- A demo that turned a vague product brief into something stakeholders could actually steer.
Lessons I keep coming back to
The hard part of a RAG system isn't the LLM. It's the retrieval pipeline, the streaming UI, and the evaluation harness.
- Hybrid retrieval beats either side alone. Dense embeddings catch semantic similarity; sparse retrieval catches exact attribute matches. Production systems need both.
- HyDE is a cheat code for vague queries. Letting the LLM rewrite the query before retrieval is one of the highest-leverage moves in the pipeline.
- Constrain the generation layer. The LLM should pick from a set, not invent the set. The moment you let it generate freely, you've given up the ability to evaluate.
- Streaming UIs are a UX problem. Tokens arriving one at a time is the easy part. Cancellation, partial product cards, late-arriving retrieval results, follow-up queries that race the previous one — that's where the React work lives.
- Evaluation harnesses are the deliverable. A RAG system without a golden set is a vibes-driven engineering project. With one, every improvement is a measurable change against a known baseline.