There's a fundamental difference between traditional ML and GenAI development: your AI applications don't get better over time. No matter if you have one user or a billion, your LLM stays static, your prompts stay static, and your system learns nothing from production data—unless you manually intervene.
This realization drove Gideon Mendels, CEO of Comet ML, to build solutions for this problem. "Every ML team I ever spoke to retrains their models with new production data," he explains in our latest conversation for AI Week.
"But with these GenAI systems, there's no mechanism to learn from additional data."
From Spreadsheets to Automated Optimization#
Gideon's path to solving this problem started 12 years ago when he moved from software engineering to ML at Google. Working on hate speech detection, he found ML teams managing everything with spreadsheets and emails—a contrast to the tooling developers had access to.
That observation led to Comet in 2018, and more recently, to Opik, their open-source platform for building production-ready LLM applications. In nine months, it's grown to nearly 15,000 GitHub stars.
The Three-Step Path to Production AI#
Gideon's suggested framework for successful GenAI applications has three components:
1. Instrument Everything
Wrap your OpenAI client or add a callback to whatever AI/LLM framework you're
using. In 30 seconds, you get complete observability—every input, output, tool
call, and execution trace. This gives you valuable insight data to start working
with.
2. Build Evaluation Datasets
Create a test suite of sample questions and correct answers. This isn't
traditional unit testing (semantic meaning matters, not exact strings), but it
serves the same purpose: confidence that changes improve rather than break
your application.
3. Automate Optimization
Opik's Agent Optimizer uses reinforcement learning to automatically generate and
test prompt candidates, turning manual prompt engineering into an automated
optimization loop.
The Continuous Improvement Loop#
When you connect these pieces, production failures get added to your evaluation dataset. Automated optimization runs generate new prompt candidates. A/B tests validate improvements against live traffic. The result is something that resembles traditional ML retraining, but for the LLM era.
You can see this in action in the video above, where we integrate Opik with Zuplo's AI Gateway in under a minute. The combination provides centralized governance, cost controls, and comprehensive observability, all of which can be used completely for free.
Why This Matters#
The teams moving successfully from POC to production follow methodologies like this. They don't just build; they measure, test, and continuously improve. They treat AI development like software engineering, with evaluation suites, regression testing, and automated optimization.
The alternative is static prompts, manual tuning, and applications that never improve despite having thousands of users generating valuable feedback data.
Try It Yourself#
Opik is fully open source, you can find everything you need to get started on GitHub, or at comet.com.
Zuplo's AI Gateway is also available for free, and features a specific policy for working with Opik.
More from AI Week#
This article is part of Zuplo's AI Week. A week dedicated to AI, LLMs and, of course, APIs centered around the release of our AI Gateway.
You can find the other articles and videos from this week below:
- Day 1: AI Gateway Overview with Zuplo CEO, Josh Twist
- Day 2: Is Spec-Driven AI Development the Future? with Guy Podjarny, CEO & Founder of Tessl
- Day 2: Using AI Gateway with LangChain & OpenAI with John McBride, Staff Software Engineer at Zuplo
- Day 3: Your AI Models Aren't Learning From Production Data with Gideon Mendels, CEO & Co-Founder of Comet ML
- Day 3: Using Claude Code with Zuplo's AI Gateway with Martyn Davies, Developer Advocate at Zuplo