Evaluation

Overview

LLM evaluation measures prompt quality, model performance, and system reliability. Unlike traditional software testing where outputs are deterministic, LLM evaluation requires semantic similarity metrics, golden datasets, and continuous regression detection. Spring AI applications must implement automated evaluation pipelines to detect prompt drift, model degradation, and accuracy regressions.

Key Concepts

Evaluation Dimensions

┌─────────────────────────────────────────────────────────────┐
│                  LLM Evaluation Dimensions                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│ CORRECTNESS (Accuracy)                                       │
│ ──────────────────────                                       │
│ Does the answer factually align with the truth?              │
│ Metrics: Exact match, semantic similarity, fact checking     │
│                                                              │
│ RELEVANCE                                                    │
│ ─────────                                                    │
│ Is the answer on-topic and addresses the question?           │
│ Metrics: Relevance score (LLM-as-judge)                      │
│                                                              │
│ FAITHFULNESS (Grounding)                                     │
│ ────────────────────────                                     │
│ Is the answer supported by the provided context?             │
│ Metrics: Citation accuracy, hallucination detection          │
│                                                              │
│ COHERENCE                                                    │
│ ─────────                                                    │
│ Is the answer well-structured and easy to understand?        │
│ Metrics: Readability, fluency score                          │
│                                                              │
│ LATENCY                                                      │
│ ───────                                                      │
│ How fast is the response?                                    │
│ Metrics: P50, P95, P99 response time                         │
│                                                              │
│ COST                                                         │
│ ────                                                         │
│ Token usage per request                                      │
│ Metrics: Input tokens, output tokens, cost per query         │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Golden Dataset Structure

public class GoldenExample {
    private String id;
    private String query;
    private String expectedAnswer;
    private List<String> acceptableAnswers; // Variations
    private Map<String, Object> metadata;
    
    // Optional: For RAG evaluation
    private List<String> requiredContext;
    private List<String> expectedCitations;
}

Best Practices

1. Build a Representative Golden Dataset

Cover edge cases, common queries, and known failure modes.

@Service
public class GoldenDatasetBuilder {
    public List<GoldenExample> buildDataset() {
        return List.of(
            // Common queries
            new GoldenExample(
                "q1",
                "What is our refund policy?",
                "Refunds are available within 30 days of purchase.",
                List.of("source:policy-manual")
            ),
            
            // Edge case: ambiguous query
            new GoldenExample(
                "q2",
                "Can I return this?",
                "I need more information. What product are you referring to?",
                List.of()
            ),
            
            // Edge case: out-of-scope
            new GoldenExample(
                "q3",
                "What's the weather today?",
                "I don't have access to weather information.",
                List.of()
            ),
            
            // Multi-hop reasoning
            new GoldenExample(
                "q4",
                "If I bought a laptop on Jan 1st, can I return it on Feb 15th?",
                "No, our 30-day return window has expired.",
                List.of("source:policy-manual")
            )
        );
    }
}

2. Use Semantic Similarity for Flexible Matching

Exact string matching is too rigid for LLM outputs.

@Service
public class SemanticEvaluator {
    private final EmbeddingModel embeddingModel;
    private static final double SIMILARITY_THRESHOLD = 0.85;
    
    public boolean isCorrect(String expected, String actual) {
        float[] expectedVec = embed(expected);
        float[] actualVec = embed(actual);
        
        double similarity = cosineSimilarity(expectedVec, actualVec);
        return similarity >= SIMILARITY_THRESHOLD;
    }
    
    public double score(String expected, String actual) {
        float[] expectedVec = embed(expected);
        float[] actualVec = embed(actual);
        return cosineSimilarity(expectedVec, actualVec);
    }
}

3. Implement LLM-as-Judge for Complex Evaluation

Use a strong LLM to evaluate weaker model outputs.

@Service
public class LLMJudge {
    private final ChatModel judgeModel; // GPT-4 for judging
    
    public EvaluationResult evaluate(String query, String answer, String groundTruth) {
        String prompt = String.format("""
            You are an expert evaluator. Score the answer on a scale of 1-5:
            
            Query: %s
            
            Ground Truth: %s
            
            Answer to Evaluate: %s
            
            Scoring criteria:
            5 = Perfect, matches ground truth
            4 = Correct, minor wording differences
            3 = Partially correct
            2 = Incorrect but relevant
            1 = Completely wrong or off-topic
            
            Output JSON:
            {"score": 1-5, "reasoning": "..."}
            """, query, groundTruth, answer);
        
        String response = judgeModel.call(new Prompt(prompt))
            .getResult()
            .getOutput()
            .getContent();
        
        return parseEvaluationResult(response);
    }
}

4. Track Evaluation Over Time (Regression Detection)

Detect when prompt/model changes degrade quality.

@Service
public class RegressionDetectionService {
    private final GoldenDatasetEvaluator evaluator;
    private final MetricStore metricStore;
    
    @Scheduled(cron = "0 0 * * * ?") // Hourly
    public void detectRegressions() {
        EvaluationMetrics current = evaluator.evaluateFullDataset();
        EvaluationMetrics baseline = metricStore.getBaseline();
        
        if (current.getAccuracy() < baseline.getAccuracy() - 0.05) { // 5% drop
            alertService.send(
                "Accuracy regression detected: " +
                baseline.getAccuracy() + " → " + current.getAccuracy()
            );
        }
        
        metricStore.saveSnapshot(current);
    }
}

5. Evaluate Retrieval Quality Separately

RAG systems need both retrieval and generation metrics.

@Service
public class RAGEvaluator {
    private final VectorStore vectorStore;
    private final ChatModel chatModel;
    
    public RetrievalMetrics evaluateRetrieval(List<GoldenExample> dataset) {
        int totalRelevant = 0;
        int totalRetrieved = 0;
        int truePositives = 0;
        
        for (GoldenExample example : dataset) {
            List<Document> retrieved = vectorStore.similaritySearch(
                SearchRequest.query(example.getQuery()).withTopK(5)
            );
            
            Set<String> retrievedIds = retrieved.stream()
                .map(Document::getId)
                .collect(Collectors.toSet());
            
            Set<String> relevantIds = new HashSet<>(example.getRequiredContext());
            
            totalRelevant += relevantIds.size();
            totalRetrieved += retrievedIds.size();
            truePositives += Sets.intersection(retrievedIds, relevantIds).size();
        }
        
        double precision = (double) truePositives / totalRetrieved;
        double recall = (double) truePositives / totalRelevant;
        double f1 = 2 * (precision * recall) / (precision + recall);
        
        return new RetrievalMetrics(precision, recall, f1);
    }
}

Code Examples

Example 1: Basic Golden Dataset Evaluation

@Service
public class GoldenDatasetEvaluator {
    private final ChatModel chatModel;
    private final SemanticEvaluator semanticEvaluator;
    
    public EvaluationReport evaluate(List<GoldenExample> dataset) {
        int correct = 0;
        List<FailedExample> failures = new ArrayList<>();
        
        for (GoldenExample example : dataset) {
            String actual = chatModel.call(new Prompt(example.getQuery()))
                .getResult()
                .getOutput()
                .getContent();
            
            if (semanticEvaluator.isCorrect(example.getExpectedAnswer(), actual)) {
                correct++;
            } else {
                failures.add(new FailedExample(example, actual));
            }
        }
        
        double accuracy = (double) correct / dataset.size();
        return new EvaluationReport(accuracy, failures);
    }
}

✅ Good for: Continuous integration, regression testing
❌ Not good for: Nuanced quality (use LLM-as-judge)

Example 2: Prompt A/B Testing

@Service
public class PromptABTester {
    private final ChatModel chatModel;
    private final LLMJudge judge;
    
    public ABTestResult comparePrompts(
        PromptTemplate templateA,
        PromptTemplate templateB,
        List<GoldenExample> dataset
    ) {
        double scoreA = evaluatePrompt(templateA, dataset);
        double scoreB = evaluatePrompt(templateB, dataset);
        
        return new ABTestResult(
            scoreA,
            scoreB,
            scoreB > scoreA ? templateB : templateA,
            Math.abs(scoreB - scoreA)
        );
    }
    
    private double evaluatePrompt(PromptTemplate template, List<GoldenExample> dataset) {
        double totalScore = 0;
        
        for (GoldenExample example : dataset) {
            Prompt prompt = template.create(Map.of("query", example.getQuery()));
            String answer = chatModel.call(prompt)
                .getResult()
                .getOutput()
                .getContent();
            
            EvaluationResult result = judge.evaluate(
                example.getQuery(),
                answer,
                example.getExpectedAnswer()
            );
            
            totalScore += result.getScore();
        }
        
        return totalScore / dataset.size();
    }
}

✅ Good for: Prompt optimization, version selection
❌ Not good for: Real-time decisions (slow evaluation)

Example 3: Faithfulness Evaluation (RAG)

@Service
public class FaithfulnessEvaluator {
    private final ChatModel chatModel;
    
    public double evaluateFaithfulness(String query, String context, String answer) {
        String prompt = String.format("""
            Determine if the answer is faithful to the context (1-5):
            
            Context:
            %s
            
            Answer:
            %s
            
            Scoring:
            5 = Fully supported by context, no hallucinations
            4 = Mostly supported, minor unsupported claims
            3 = Partially supported
            2 = Mostly unsupported
            1 = Completely fabricated
            
            Output only the score (1-5):
            """, context, answer);
        
        String response = chatModel.call(new Prompt(prompt))
            .getResult()
            .getOutput()
            .getContent();
        
        return Double.parseDouble(response.trim());
    }
}

✅ Good for: Detecting hallucinations
❌ Not good for: When context is not available

Example 4: Cost and Latency Evaluation

@Service
public class PerformanceEvaluator {
    private final ChatModel chatModel;
    
    public PerformanceMetrics evaluate(List<String> queries) {
        List<Long> latencies = new ArrayList<>();
        int totalInputTokens = 0;
        int totalOutputTokens = 0;
        
        for (String query : queries) {
            long start = System.currentTimeMillis();
            
            ChatResponse response = chatModel.call(new Prompt(query));
            
            long latency = System.currentTimeMillis() - start;
            latencies.add(latency);
            
            totalInputTokens += response.getMetadata().getUsage().getPromptTokens();
            totalOutputTokens += response.getMetadata().getUsage().getGenerationTokens();
        }
        
        Collections.sort(latencies);
        
        return new PerformanceMetrics(
            latencies.get(latencies.size() / 2), // P50
            latencies.get((int) (latencies.size() * 0.95)), // P95
            totalInputTokens,
            totalOutputTokens,
            estimateCost(totalInputTokens, totalOutputTokens)
        );
    }
    
    private double estimateCost(int inputTokens, int outputTokens) {
        return (inputTokens * 0.00001) + (outputTokens * 0.00003); // Example pricing
    }
}

✅ Good for: SLO validation, cost optimization
❌ Not good for: Quality evaluation (use semantic metrics)

Example 5: Automated Regression Suite

@RestController
@RequestMapping("/api/evaluation")
public class EvaluationController {
    private final GoldenDatasetEvaluator evaluator;
    private final GoldenDataset goldenDataset;
    
    @PostMapping("/run")
    public EvaluationReport runEvaluation() {
        return evaluator.evaluate(goldenDataset.load());
    }
    
    @GetMapping("/history")
    public List<EvaluationSnapshot> getHistory() {
        return metricStore.getSnapshots(Duration.ofDays(30));
    }
    
    @PostMapping("/baseline")
    public void setBaseline(@RequestBody EvaluationMetrics metrics) {
        metricStore.setBaseline(metrics);
    }
}

✅ Good for: CI/CD integration, dashboards
❌ Not good for: Ad-hoc testing (use manual evaluation)

Anti-Patterns

❌ No Golden Dataset

// DON'T: Manual testing only
// "Looks good to me" ← Not scalable

Why: Cannot detect regressions; no objective quality measure.

✅ DO: Build and maintain golden dataset

List<GoldenExample> dataset = goldenDatasetBuilder.build();
EvaluationReport report = evaluator.evaluate(dataset);

❌ Exact String Matching

// DON'T: Too rigid
assertEquals(expected, actual);

Why: LLMs produce varied outputs; exact match fails for valid answers.

✅ DO: Use semantic similarity

assertTrue(semanticEvaluator.isCorrect(expected, actual));

❌ Evaluating Only Accuracy

// DON'T: Ignore latency and cost
double accuracy = evaluator.getAccuracy();

Why: Slow or expensive systems are not production-ready.

✅ DO: Evaluate multiple dimensions

EvaluationMetrics metrics = evaluator.evaluate();
assertTrue(metrics.getAccuracy() > 0.90);
assertTrue(metrics.getP95Latency() < 2000); // < 2 seconds
assertTrue(metrics.getCostPerQuery() < 0.05); // < $0.05

References

chat-models.md — LLM integration
prompt-templates.md — Prompt versioning
retrieval.md — RAG quality metrics
observability.md — Production monitoring

Evaluation

Evaluation

Overview

Key Concepts

Evaluation Dimensions

Golden Dataset Structure

Best Practices

1. Build a Representative Golden Dataset

2. Use Semantic Similarity for Flexible Matching

3. Implement LLM-as-Judge for Complex Evaluation

4. Track Evaluation Over Time (Regression Detection)

5. Evaluate Retrieval Quality Separately

Code Examples

Example 1: Basic Golden Dataset Evaluation

Example 2: Prompt A/B Testing

Example 3: Faithfulness Evaluation (RAG)

Example 4: Cost and Latency Evaluation

Example 5: Automated Regression Suite

Anti-Patterns

❌ No Golden Dataset

❌ Exact String Matching

❌ Evaluating Only Accuracy

References

Related Skills