Designing AI Evaluation Frameworks for Regulated Industries
How we built the eval framework that gates every AI capability at LATAM before production.
Nicolás Venegas
At LATAM Airlines, my team and I built Cosmos — an internal AI platform where every capability had to earn its way into production through rigorous evaluation, safety guardrails, and human-in-the-loop design.
LATAM Airlines needed to go from zero AI capabilities to production-grade LLM applications serving millions of passengers. My team and I designed and built Cosmos — an internal AI platform with RAG pipelines, evaluation frameworks, model routing, and agentic architectures.
I drove the key architecture decisions: choosing LightRAG for retrieval, designing the evaluation framework that gates every capability before production, and defining the model routing strategy for cost optimization. The platform now handles 779K+ RAG calls in production across 7+ AI capabilities shipped in under 2 years.
779K+
RAG calls in production
7+
AI capabilities shipped
2 yrs
Zero to platform at scale
50K+
Pages indexed (aviation manuals)
Internal AI platform for LATAM Airlines
LATAM had zero AI infrastructure — no pipelines, no evaluation, no model management. Every team was experimenting independently with no path to production.
LightRAG for retrieval, GenAI Gateway for model routing, centralized evaluation framework, agentic orchestration layer. Built as a platform so every team ships through the same production-grade pipeline.
779K+ RAG calls in production, 7+ AI capabilities shipped, 50K+ aviation manual pages indexed. Platform serves millions of passengers.
Evaluation framework for LLM applications in aviation
Aviation is a regulated industry where hallucinations aren't just annoying — they're dangerous. There was no systematic way to measure whether an LLM capability was safe to deploy.
Multi-dimensional evaluation pipeline: hallucination detection, faithfulness scoring, safety checks, cost-per-conversation metrics. Runs on every capability before production deployment.
Every AI capability at LATAM passes through this framework before reaching production. Catches critical failures before they reach passengers.
Agentic customer service chatbot with tool use and human-in-the-loop
Customer service needed AI-powered assistance across multiple languages, but agentic systems fail in unpredictable ways — especially in high-stakes travel scenarios.
Agentic architecture with tool use for booking operations, multi-language support, human-in-the-loop escalation paths, and structured fallback strategies.
Serving LATAM passengers with AI-powered assistance while maintaining safety through structured escalation and human oversight.
Connecting legacy airline systems to Claude via Model Context Protocol
Legacy airline systems hold critical data but expose it through outdated interfaces that modern AI tools can't access natively.
Model Context Protocol server bridging Claude to airline reservation, flight, and operations systems.
In development — will include public GitHub repository.
Upcoming technical posts:
How we built the eval framework that gates every AI capability at LATAM before production.
Real failure modes from building and deploying LATAM Chat.
Architecture deep dive into the Cosmos retrieval pipeline.
Upcoming
May 7–8, 2026 · Bogotá, Colombia
Cosmos case study at the first AI applied-to-business summit in LATAM. How LATAM Airlines went from zero AI to platform at scale — business impact, metrics, and operational lessons for an enterprise audience.
Upcoming
May 2026 · San Francisco
Technical deep dive on AI evaluation and agentic architecture patterns in aviation — how we designed them and what broke.
Previous
Past conference talks to be added. Previous experience includes speaking at technology and data conferences across LATAM.