Daniel Cárdenas

Full-Stack Builder: Firmware · Web · AI

I design and ship end-to-end systems—from STM32 firmware and edge data capture to Go/Python backends and multilingual RAG experiences.

XPLevel 1
0 XP150 to level up
Sep 2025ML & Backend2 min read

Multilingual Multi-Hop RAG

Local-first RAG with pgvector HNSW, BGE-M3, multilingual evals.

DockerizedRAG <5s p95Multilingual
PythonFastAPIpgvectorOllamaRAG

Outcomes

  • Latency p50 2.8s / p95 4.9s
  • +18% Recall@5 vs baseline

Problem

Customer support teams needed localized answer quality faster than LLM SaaS allowed. The incumbent pipeline combined a basic keyword search with manual translation passes, taking ~35 seconds per query and missing domain-specific synonyms across Spanish and Portuguese tickets.

Approach

I built a local-first retrieval stack:

  • FastAPI orchestrates ingestion, chunking, and retrieval with async workers.
  • BGE-M3 embeddings stored in pgvector using HNSW for sub-5s p95 latency on commodity hardware.
  • Adaptive prompt chain executes multilingual multi-hop reasoning: translate → retrieve → cite → synthesize.
  • Eval harness (Python + Pandas) measures recall@k and BLEU across Spanish/English corpora with automatically generated contrastive questions.
  • Infra packaged via Docker Compose with GPU optional path; GitHub Actions runs nightly regression evals.

Results

  • Latency p50 2.8s and p95 4.9s on a Ryzen 7 + RTX 3060 workstation.
  • Recall@5 improved 18% over the keyword baseline, with automated alerts on drift.
  • Delivered an offline-first workflow enabling on-prem deployments in LATAM markets.