Sep 2025•ML & Backend•2 min read
Multilingual Multi-Hop RAG
Local-first RAG with pgvector HNSW, BGE-M3, multilingual evals.
DockerizedRAG <5s p95Multilingual
PythonFastAPIpgvectorOllamaRAG
Outcomes
- Latency p50 2.8s / p95 4.9s
- +18% Recall@5 vs baseline
Problem
Customer support teams needed localized answer quality faster than LLM SaaS allowed. The incumbent pipeline combined a basic keyword search with manual translation passes, taking ~35 seconds per query and missing domain-specific synonyms across Spanish and Portuguese tickets.
Approach
I built a local-first retrieval stack:
- FastAPI orchestrates ingestion, chunking, and retrieval with async workers.
- BGE-M3 embeddings stored in pgvector using HNSW for sub-5s p95 latency on commodity hardware.
- Adaptive prompt chain executes multilingual multi-hop reasoning: translate → retrieve → cite → synthesize.
- Eval harness (Python + Pandas) measures recall@k and BLEU across Spanish/English corpora with automatically generated contrastive questions.
- Infra packaged via Docker Compose with GPU optional path; GitHub Actions runs nightly regression evals.
Results
- Latency p50 2.8s and p95 4.9s on a Ryzen 7 + RTX 3060 workstation.
- Recall@5 improved 18% over the keyword baseline, with automated alerts on drift.
- Delivered an offline-first workflow enabling on-prem deployments in LATAM markets.