In the world of AI, accuracy is king. We obsess over whether our Large Language Models (LLMs) provide the right answer. But what happens when the right answer is delivered in the wrong way? For a customer-facing AI agent, the quality of an interaction isn't just about facts—it's about feel. Tone, empathy, and helpfulness can be the difference between a satisfied customer and a frustrated one.
This is the story of how a team used a structured evaluation process to fix an AI agent that was technically correct but failing its users. They went from subjective feedback like "your bot is robotic" to a quantifiable metric they could systematically improve.
Here’s how they did it with Evals.do.
The team had developed a customer support agent, "SupportBot v1." By all traditional metrics, it was a success. It could look up order statuses, explain billing cycles, and answer product questions with high factual accuracy.
Yet, user satisfaction scores were low. The feedback was consistent:
The agent was accurate, but it lacked the empathetic and professional tone expected from a support specialist. The problem was clear, but the solution wasn't. How do you "fix" a bad tone? How do you measure something so subjective and ensure it doesn't degrade with the next model update?
The first step was to move from vague feelings to a concrete number. The team used Evals.do to define a custom evaluation metric specifically for this problem.
By defining this metric, "bad tone" was no longer a subjective complaint; it was a measurable Key Performance Indicator (KPI).
A metric is useless without a consistent way to test it. The team created a dataset in Evals.do containing a few dozen real-world scenarios that their support agent would face. This wasn't just a list of simple questions; it was a gauntlet of customer emotions and complexities:
This dataset ensured the agent's tone would be tested against a full spectrum of interactions, providing a reliable baseline for its performance.
With the metric and dataset in place, it was time to get the initial score. The team ran an evaluation on "SupportBot v1." The results, captured in Evals.do, were illuminating.
{
"evaluationId": "eval_8a7d6e8f4c",
"agentId": "customer-support-agent-v1",
"status": "completed",
"overallScore": 3.82,
"passed": false,
"metrics": [
{
"name": "accuracy",
"score": 4.7,
"threshold": 4.0,
"passed": true
},
{
"name": "tone",
"score": 3.6,
"threshold": 4.5,
"passed": false
}
],
"evaluatedAt": "2024-10-27T10:30:00Z"
}
The data confirmed their hypothesis perfectly. The agent's accuracy score was a high 4.7, easily passing the threshold. But the tone score was a dismal 3.6, failing the evaluation. They now had a concrete benchmark to beat.
Now for the fun part: fixing the problem. The team focused on the agent's core instructions—its system prompt.
The Original Prompt (V1):
"You are a helpful assistant. Answer the user's question based on the provided data."
This prompt optimized for accuracy but ignored the user experience. The team drafted a new prompt aimed directly at improving the tone score.
The Improved Prompt (V2):
"You are a friendly and empathetic customer support specialist for our company. Your primary goal is to make the customer feel heard and valued. Always start by acknowledging their situation before providing a solution. Maintain a warm and professional tone. For example, instead of 'Your order is delayed,' say 'I understand how frustrating a delay can be; let me look into what's happening with your order for you right away.'"
This simple change completely reframed the agent's persona and interaction style.
The team updated their agent with the new prompt, creating "SupportBot v2," and ran the exact same evaluation on Evals.do. The ability to run an identical test against a new version is crucial for measuring true progress.
The results spoke for themselves:
{
"evaluationId": "eval_9b1e7f9g5d",
"agentId": "customer-support-agent-v2",
"status": "completed",
"overallScore": 4.63,
"passed": true,
"metrics": [
{
"name": "accuracy",
"score": 4.68,
"threshold": 4.0,
"passed": true
},
{
"name": "tone",
"score": 4.57,
"threshold": 4.5,
"passed": true
}
],
"evaluatedAt": "2024-10-28T11:00:00Z"
}
Success! The tone score jumped from 3.6 to 4.57—a 27% improvement. Not only did it now pass the 4.5 threshold, but it did so without significantly impacting accuracy. The team had successfully translated a subjective user complaint into a measurable metric and systematically improved it.
This case study isn't just about a one-time fix. The real power comes from integrating this evaluation into their continuous integration and deployment (CI/CD) pipeline. Now, every time a developer wants to update the agent—whether it's changing the prompt, upgrading the LLM, or adding a new tool—an evaluation is triggered automatically.
This creates a quality gate, ensuring that a future change won't accidentally make the agent "robotic" again. It's how modern AI teams move from building agents to reliably maintaining high-quality, production-ready AI systems.
Stop guessing about your AI's quality. Start measuring what matters.
What can I evaluate with Evals.do?
You can evaluate any AI component, from individual LLM-powered functions and complex agentic workflows to full conversational agents. The platform is designed to be flexible and adaptable to your specific needs.
How are evaluations scored?
You define custom metrics (like accuracy, tone, or relevance) with specific scales and passing thresholds. Evals.do can use a combination of automated LLM-as-a-judge evaluators and human review to score performance against your defined criteria.
Can I integrate Evals.do into my CI/CD pipeline?
Yes, Evals.do provides a simple API and SDKs, making it easy to trigger evaluations automatically as part of your CI/CD pipeline. This allows you to continuously monitor and prevent AI performance regressions before they reach production.
Ready to quantify the performance and quality of your own AI agents? Visit Evals.do to get started.