Building with AI and Large Language Models (LLMs) is exhilarating. You can create functions that summarize text, classify sentiment, or even generate code. But as you move from a cool prototype to a production-ready application, a critical question emerges: How do you know it's working correctly? And more importantly, how do you ensure it keeps working correctly as you make changes?
Traditional unit tests check for deterministic, binary outcomes. But AI is non-deterministic. Its quality is nuanced. Is the response accurate? Is the tone appropriate? Is it helpful?
This is where a systematic approach to AI evaluation becomes essential. At Evals.do, we provide a unified platform to test, measure, and ensure the quality of your AI systems, from discrete functions to complex agentic workflows.
In this guide, we'll walk you through setting up your very first AI function evaluation in under 10 minutes. Let's move from hoping your AI works to knowing it does.
Think about a simple unit test: assert add(2, 2) == 4. The outcome is predictable and absolute. Now, consider an AI function that summarizes an article. There isn't one single "correct" summary. There are good summaries, bad ones, ones that miss the point, and ones that nail the nuance.
An AI Evaluation measures these qualitative and quantitative aspects. Instead of a simple pass/fail, it scores performance against key metrics like:
Evals.do allows you to define these metrics as code, run them against test datasets, and get actionable insights to improve your AI's performance and ship with confidence.
Let's imagine we've built a simple AI function, classify_sentiment, that takes a customer review and returns "Positive", "Negative", or "Neutral".
# Our amazing AI function we want to test
def classify_sentiment(text: str) -> str:
# ...magic LLM call happens here...
return "Positive" # or "Negative", "Neutral"
How do we evaluate it? We'll use Evals.do to test it against a predefined set of examples.
First, we need a "ground truth" dataset. This is a collection of inputs and their expected outputs. In a real-world scenario, this dataset would be much larger, but for our example, a simple list of dictionaries will do.
# test_cases.py
TEST_DATASET = [
{"input": "The service was incredibly fast and friendly!", "expected": "Positive"},
{"input": "I'm very happy with the purchase.", "expected": "Positive"},
{"input": "The item arrived, but it was the wrong color.", "expected": "Negative"},
{"input": "The package was delivered on time.", "expected": "Neutral"},
]
With Evals.do, your evaluations live alongside your application code. Using our SDK, you can easily define which function to test, what dataset to use, and how to measure success.
# my_first_evaluation.py
from evals_do import Evals, Evaluator, Dataset
from my_app import classify_sentiment # Import the function to test
from test_cases import TEST_DATASET
# 1. Define the evaluation
sentiment_evaluation = Evals(
name="Customer Sentiment Classifier Evaluation",
target=classify_sentiment, # The AI function we are testing
dataset=Dataset(TEST_DATASET),
# 2. Define how to measure performance
evaluators=[
Evaluator.Accuracy(threshold=0.95), # Pass if 95% or more are correct
Evaluator.Latency(max_ms=500), # Pass if avg. response is under 500ms
]
)
# 3. Run the evaluation
if __name__ == "__main__":
sentiment_evaluation.run()
This simple file defines everything needed for a rigorous test: the target function, the dataset, and the evaluators that score the results against our performance thresholds.
Now, simply run the evaluation from your terminal. Evals.do executes your AI function against each entry in the dataset, runs the results through your defined evaluators, and aggregates the scores.
python my_first_evaluation.py
After the run completes, Evals.do provides a detailed JSON summary. This output can be viewed in our UI, sent as a notification, or used to programmatically gate a deployment in your CI/CD pipeline.
Here’s an example of what that output looks like for a more complex agent evaluation:
{
"evaluationRunId": "run_a3b8c1d9e0f7",
"evaluationName": "Customer Support Agent Evaluation",
"status": "Completed",
"overallResult": "FAIL",
"timestamp": "2023-10-27T10:00:00Z",
"summary": {
"totalTests": 150,
"passed": 135,
"failed": 15,
"passRate": 0.9
},
"metricResults": [
{
"name": "accuracy",
"averageScore": 4.1,
"threshold": 4.0,
"result": "PASS"
},
{
"name": "helpfulness",
"averageScore": 4.3,
"threshold": 4.2,
"result": "PASS"
},
{
"name": "tone",
"averageScore": 4.4,
"threshold": 4.5,
"result": "FAIL"
}
]
}
Look closely at this result. Even with a 90% pass rate and good scores on accuracy and helpfulness, the overallResult is a FAIL. Why? Because the tone metric dipped just below the required threshold of 4.5. This is the power of systematic AI evaluation—catching subtle but critical regressions before they impact your users.
You've just completed your first AI function evaluation! You now have a repeatable, data-driven way to ensure your sentiment classifier performs as expected.
This is just the beginning. With Evals.do, you can apply the same principles to much more complex systems:
Ready to take control of your AI quality? Explore the Evals.do platform and start building more reliable, accurate, and trustworthy AI systems today. Measure, Monitor, Improve.