Evaluation
Evaluation
Overview
LLM evaluation measures prompt quality, model performance, and system reliability. Unlike traditional software testing where outputs are deterministic, LLM evaluation requires semantic similarity metrics, golden datasets, and continuous regression detection. Spring AI applications must implement automated evaluation pipelines to detect prompt drift, model degradation, and accuracy regressions.
Key Concepts
Evaluation Dimensions
┌─────────────────────────────────────────────────────────────┐
│ LLM Evaluation Dimensions │
├─────────────────────────────────────────────────────────────┤
│ │
│ CORRECTNESS (Accuracy) │
│ ────────────────────── │
│ Does the answer factually align with the truth? │
│ Metrics: Exact match, semantic similarity, fact checking │
│ │
│ RELEVANCE │
│ ───────── │
│ Is the answer on-topic and addresses the question? │
│ Metrics: Relevance score (LLM-as-judge) │
│ │
│ FAITHFULNESS (Grounding) │
│ ──────────────────────── │
│ Is the answer supported by the provided context? │
│ Metrics: Citation accuracy, hallucination detection │
│ │
│ COHERENCE │
│ ───────── │
│ Is the answer well-structured and easy to understand? │
│ Metrics: Readability, fluency score │
│ │
│ LATENCY │
│ ─────── │
│ How fast is the response? │
│ Metrics: P50, P95, P99 response time │
│ │
│ COST │
│ ──── │
│ Token usage per request │
│ Metrics: Input tokens, output tokens, cost per query │
│ │
└─────────────────────────────────────────────────────────────┘
Golden Dataset Structure
public class GoldenExample {
private String id;
private String query;
private String expectedAnswer;
private List<String> acceptableAnswers; // Variations
private Map<String, Object> metadata;
// Optional: For RAG evaluation
private List<String> requiredContext;
private List<String> expectedCitations;
}
Best Practices
1. Build a Representative Golden Dataset
Cover edge cases, common queries, and known failure modes.
@Service
public class GoldenDatasetBuilder {
public List<GoldenExample> buildDataset() {
return List.of(
// Common queries
new GoldenExample(
"q1",
"What is our refund policy?",
"Refunds are available within 30 days of purchase.",
List.of("source:policy-manual")
),
// Edge case: ambiguous query
new GoldenExample(
"q2",
"Can I return this?",
"I need more information. What product are you referring to?",
List.of()
),
// Edge case: out-of-scope
new GoldenExample(
"q3",
"What's the weather today?",
"I don't have access to weather information.",
List.of()
),
// Multi-hop reasoning
new GoldenExample(
"q4",
"If I bought a laptop on Jan 1st, can I return it on Feb 15th?",
"No, our 30-day return window has expired.",
List.of("source:policy-manual")
)
);
}
}
2. Use Semantic Similarity for Flexible Matching
Exact string matching is too rigid for LLM outputs.
@Service
public class SemanticEvaluator {
private final EmbeddingModel embeddingModel;
private static final double SIMILARITY_THRESHOLD = 0.85;
public boolean isCorrect(String expected, String actual) {
float[] expectedVec = embed(expected);
float[] actualVec = embed(actual);
double similarity = cosineSimilarity(expectedVec, actualVec);
return similarity >= SIMILARITY_THRESHOLD;
}
public double score(String expected, String actual) {
float[] expectedVec = embed(expected);
float[] actualVec = embed(actual);
return cosineSimilarity(expectedVec, actualVec);
}
}
3. Implement LLM-as-Judge for Complex Evaluation
Use a strong LLM to evaluate weaker model outputs.
@Service
public class LLMJudge {
private final ChatModel judgeModel; // GPT-4 for judging
public EvaluationResult evaluate(String query, String answer, String groundTruth) {
String prompt = String.format("""
You are an expert evaluator. Score the answer on a scale of 1-5:
Query: %s
Ground Truth: %s
Answer to Evaluate: %s
Scoring criteria:
5 = Perfect, matches ground truth
4 = Correct, minor wording differences
3 = Partially correct
2 = Incorrect but relevant
1 = Completely wrong or off-topic
Output JSON:
{"score": 1-5, "reasoning": "..."}
""", query, groundTruth, answer);
String response = judgeModel.call(new Prompt(prompt))
.getResult()
.getOutput()
.getContent();
return parseEvaluationResult(response);
}
}
4. Track Evaluation Over Time (Regression Detection)
Detect when prompt/model changes degrade quality.
@Service
public class RegressionDetectionService {
private final GoldenDatasetEvaluator evaluator;
private final MetricStore metricStore;
@Scheduled(cron = "0 0 * * * ?") // Hourly
public void detectRegressions() {
EvaluationMetrics current = evaluator.evaluateFullDataset();
EvaluationMetrics baseline = metricStore.getBaseline();
if (current.getAccuracy() < baseline.getAccuracy() - 0.05) { // 5% drop
alertService.send(
"Accuracy regression detected: " +
baseline.getAccuracy() + " → " + current.getAccuracy()
);
}
metricStore.saveSnapshot(current);
}
}
5. Evaluate Retrieval Quality Separately
RAG systems need both retrieval and generation metrics.
@Service
public class RAGEvaluator {
private final VectorStore vectorStore;
private final ChatModel chatModel;
public RetrievalMetrics evaluateRetrieval(List<GoldenExample> dataset) {
int totalRelevant = 0;
int totalRetrieved = 0;
int truePositives = 0;
for (GoldenExample example : dataset) {
List<Document> retrieved = vectorStore.similaritySearch(
SearchRequest.query(example.getQuery()).withTopK(5)
);
Set<String> retrievedIds = retrieved.stream()
.map(Document::getId)
.collect(Collectors.toSet());
Set<String> relevantIds = new HashSet<>(example.getRequiredContext());
totalRelevant += relevantIds.size();
totalRetrieved += retrievedIds.size();
truePositives += Sets.intersection(retrievedIds, relevantIds).size();
}
double precision = (double) truePositives / totalRetrieved;
double recall = (double) truePositives / totalRelevant;
double f1 = 2 * (precision * recall) / (precision + recall);
return new RetrievalMetrics(precision, recall, f1);
}
}
Code Examples
Example 1: Basic Golden Dataset Evaluation
@Service
public class GoldenDatasetEvaluator {
private final ChatModel chatModel;
private final SemanticEvaluator semanticEvaluator;
public EvaluationReport evaluate(List<GoldenExample> dataset) {
int correct = 0;
List<FailedExample> failures = new ArrayList<>();
for (GoldenExample example : dataset) {
String actual = chatModel.call(new Prompt(example.getQuery()))
.getResult()
.getOutput()
.getContent();
if (semanticEvaluator.isCorrect(example.getExpectedAnswer(), actual)) {
correct++;
} else {
failures.add(new FailedExample(example, actual));
}
}
double accuracy = (double) correct / dataset.size();
return new EvaluationReport(accuracy, failures);
}
}
✅ Good for: Continuous integration, regression testing
❌ Not good for: Nuanced quality (use LLM-as-judge)
Example 2: Prompt A/B Testing
@Service
public class PromptABTester {
private final ChatModel chatModel;
private final LLMJudge judge;
public ABTestResult comparePrompts(
PromptTemplate templateA,
PromptTemplate templateB,
List<GoldenExample> dataset
) {
double scoreA = evaluatePrompt(templateA, dataset);
double scoreB = evaluatePrompt(templateB, dataset);
return new ABTestResult(
scoreA,
scoreB,
scoreB > scoreA ? templateB : templateA,
Math.abs(scoreB - scoreA)
);
}
private double evaluatePrompt(PromptTemplate template, List<GoldenExample> dataset) {
double totalScore = 0;
for (GoldenExample example : dataset) {
Prompt prompt = template.create(Map.of("query", example.getQuery()));
String answer = chatModel.call(prompt)
.getResult()
.getOutput()
.getContent();
EvaluationResult result = judge.evaluate(
example.getQuery(),
answer,
example.getExpectedAnswer()
);
totalScore += result.getScore();
}
return totalScore / dataset.size();
}
}
✅ Good for: Prompt optimization, version selection
❌ Not good for: Real-time decisions (slow evaluation)
Example 3: Faithfulness Evaluation (RAG)
@Service
public class FaithfulnessEvaluator {
private final ChatModel chatModel;
public double evaluateFaithfulness(String query, String context, String answer) {
String prompt = String.format("""
Determine if the answer is faithful to the context (1-5):
Context:
%s
Answer:
%s
Scoring:
5 = Fully supported by context, no hallucinations
4 = Mostly supported, minor unsupported claims
3 = Partially supported
2 = Mostly unsupported
1 = Completely fabricated
Output only the score (1-5):
""", context, answer);
String response = chatModel.call(new Prompt(prompt))
.getResult()
.getOutput()
.getContent();
return Double.parseDouble(response.trim());
}
}
✅ Good for: Detecting hallucinations
❌ Not good for: When context is not available
Example 4: Cost and Latency Evaluation
@Service
public class PerformanceEvaluator {
private final ChatModel chatModel;
public PerformanceMetrics evaluate(List<String> queries) {
List<Long> latencies = new ArrayList<>();
int totalInputTokens = 0;
int totalOutputTokens = 0;
for (String query : queries) {
long start = System.currentTimeMillis();
ChatResponse response = chatModel.call(new Prompt(query));
long latency = System.currentTimeMillis() - start;
latencies.add(latency);
totalInputTokens += response.getMetadata().getUsage().getPromptTokens();
totalOutputTokens += response.getMetadata().getUsage().getGenerationTokens();
}
Collections.sort(latencies);
return new PerformanceMetrics(
latencies.get(latencies.size() / 2), // P50
latencies.get((int) (latencies.size() * 0.95)), // P95
totalInputTokens,
totalOutputTokens,
estimateCost(totalInputTokens, totalOutputTokens)
);
}
private double estimateCost(int inputTokens, int outputTokens) {
return (inputTokens * 0.00001) + (outputTokens * 0.00003); // Example pricing
}
}
✅ Good for: SLO validation, cost optimization
❌ Not good for: Quality evaluation (use semantic metrics)
Example 5: Automated Regression Suite
@RestController
@RequestMapping("/api/evaluation")
public class EvaluationController {
private final GoldenDatasetEvaluator evaluator;
private final GoldenDataset goldenDataset;
@PostMapping("/run")
public EvaluationReport runEvaluation() {
return evaluator.evaluate(goldenDataset.load());
}
@GetMapping("/history")
public List<EvaluationSnapshot> getHistory() {
return metricStore.getSnapshots(Duration.ofDays(30));
}
@PostMapping("/baseline")
public void setBaseline(@RequestBody EvaluationMetrics metrics) {
metricStore.setBaseline(metrics);
}
}
✅ Good for: CI/CD integration, dashboards
❌ Not good for: Ad-hoc testing (use manual evaluation)
Anti-Patterns
❌ No Golden Dataset
// DON'T: Manual testing only
// "Looks good to me" ← Not scalable
Why: Cannot detect regressions; no objective quality measure.
✅ DO: Build and maintain golden dataset
List<GoldenExample> dataset = goldenDatasetBuilder.build();
EvaluationReport report = evaluator.evaluate(dataset);
❌ Exact String Matching
// DON'T: Too rigid
assertEquals(expected, actual);
Why: LLMs produce varied outputs; exact match fails for valid answers.
✅ DO: Use semantic similarity
assertTrue(semanticEvaluator.isCorrect(expected, actual));
❌ Evaluating Only Accuracy
// DON'T: Ignore latency and cost
double accuracy = evaluator.getAccuracy();
Why: Slow or expensive systems are not production-ready.
✅ DO: Evaluate multiple dimensions
EvaluationMetrics metrics = evaluator.evaluate();
assertTrue(metrics.getAccuracy() > 0.90);
assertTrue(metrics.getP95Latency() < 2000); // < 2 seconds
assertTrue(metrics.getCostPerQuery() < 0.05); // < $0.05
References
Related Skills
chat-models.md— LLM integrationprompt-templates.md— Prompt versioningretrieval.md— RAG quality metricsobservability.md— Production monitoring