Observability

Overview

Observability in Spring AI applications requires tracking LLM-specific metrics beyond traditional application monitoring: token usage, prompt versions, model latency, cost attribution, and response quality. Effective observability enables cost optimization, SLO enforcement, and rapid incident response in production AI systems.

Key Concepts

The Three Pillars of AI Observability

┌─────────────────────────────────────────────────────────────┐
│            AI Observability Stack                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│ METRICS (Quantitative)                                       │
│ ──────────────────────                                       │
│ - Token usage (input/output)                                 │
│ - Latency (P50, P95, P99)                                    │
│ - Cost per request                                           │
│ - Error rates (timeout, rate limit, server error)            │
│ - Cache hit rates                                            │
│                                                              │
│ LOGS (Contextual)                                            │
│ ────────────────                                             │
│ - Prompt text (sampled or hashed)                            │
│ - Model responses                                            │
│ - Tool calls and arguments                                   │
│ - User IDs and session IDs                                   │
│ - Error messages and stack traces                            │
│                                                              │
│ TRACES (Distributed)                                         │
│ ──────────────────                                           │
│ - End-to-end request flow                                    │
│ - Retrieval → Generation → Post-processing                   │
│ - Multiple model calls in agentic workflows                  │
│ - External API calls (VectorStore, databases)                │
│ - Latency attribution per component                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Critical Metrics to Track

Metric	Why It Matters	SLO Example
Input Tokens	Cost attribution	< 2000 tokens/request
Output Tokens	Cost, latency	< 500 tokens/response
Total Cost	Budget control	< $0.10/request
P95 Latency	User experience	< 2 seconds
Error Rate	Reliability	< 0.1%
Cache Hit Rate	Cost optimization	> 30%
Prompt Version	Regression tracking	N/A
Model Used	Cost/quality tradeoff	N/A

Best Practices

1. Instrument All LLM Calls

Use AOP to automatically track metrics without cluttering business logic.

@Component
@Aspect
public class LLMObservabilityAspect {
    private final MeterRegistry meterRegistry;
    private final Tracer tracer;
    
    @Around("execution(* org.springframework.ai.chat.ChatModel.call(..))")
    public Object trackChatModelCall(ProceedingJoinPoint joinPoint) throws Throwable {
        Span span = tracer.nextSpan().name("llm.call").start();
        Timer.Sample sample = Timer.start(meterRegistry);
        
        try (Tracer.SpanInScope ws = tracer.withSpan(span)) {
            Prompt prompt = (Prompt) joinPoint.getArgs()[0];
            
            // Extract metadata
            String promptVersion = extractPromptVersion(prompt);
            String model = extractModel(prompt);
            
            // Add trace tags
            span.tag("prompt.version", promptVersion);
            span.tag("model", model);
            
            // Execute
            ChatResponse response = (ChatResponse) joinPoint.proceed();
            
            // Track token usage
            int inputTokens = response.getMetadata().getUsage().getPromptTokens();
            int outputTokens = response.getMetadata().getUsage().getGenerationTokens();
            int totalTokens = response.getMetadata().getUsage().getTotalTokens();
            
            meterRegistry.counter("llm.tokens.input", 
                "model", model,
                "prompt.version", promptVersion
            ).increment(inputTokens);
            
            meterRegistry.counter("llm.tokens.output",
                "model", model,
                "prompt.version", promptVersion
            ).increment(outputTokens);
            
            // Track cost
            double cost = estimateCost(model, inputTokens, outputTokens);
            meterRegistry.summary("llm.cost",
                "model", model
            ).record(cost);
            
            span.tag("tokens.input", String.valueOf(inputTokens));
            span.tag("tokens.output", String.valueOf(outputTokens));
            span.tag("cost", String.format("%.4f", cost));
            
            return response;
        } catch (Exception e) {
            span.tag("error", e.getClass().getSimpleName());
            meterRegistry.counter("llm.errors",
                "error.type", e.getClass().getSimpleName()
            ).increment();
            throw e;
        } finally {
            sample.stop(meterRegistry.timer("llm.latency", 
                "model", extractModelFromContext()
            ));
            span.end();
        }
    }
    
    private double estimateCost(String model, int inputTokens, int outputTokens) {
        return switch (model) {
            case "gpt-4-turbo" -> (inputTokens * 0.00001) + (outputTokens * 0.00003);
            case "gpt-3.5-turbo" -> (inputTokens * 0.0000005) + (outputTokens * 0.0000015);
            case "claude-3-opus" -> (inputTokens * 0.000015) + (outputTokens * 0.000075);
            default -> 0.0;
        };
    }
}

2. Log Prompts and Responses (with PII Redaction)

Enable debugging without compromising user privacy.

@Component
@Aspect
public class LLMLoggingAspect {
    private static final Logger log = LoggerFactory.getLogger(LLMLoggingAspect.class);
    private final PIIRedactor piiRedactor;
    
    @AfterReturning(
        pointcut = "execution(* org.springframework.ai.chat.ChatModel.call(..))",
        returning = "response"
    )
    public void logLLMInteraction(JoinPoint joinPoint, ChatResponse response) {
        Prompt prompt = (Prompt) joinPoint.getArgs()[0];
        String promptText = prompt.getContents();
        String responseText = response.getResult().getOutput().getContent();
        
        // Redact PII
        String safePrompt = piiRedactor.redact(promptText);
        String safeResponse = piiRedactor.redact(responseText);
        
        // Structured logging
        log.info("LLM Interaction: prompt={}, response={}, tokens={}, latency={}ms",
            safePrompt.substring(0, Math.min(100, safePrompt.length())), // First 100 chars
            safeResponse.substring(0, Math.min(100, safeResponse.length())),
            response.getMetadata().getUsage().getTotalTokens(),
            response.getMetadata().getRateLimit().getRequestsRemaining()
        );
    }
}

3. Create Custom Dashboards

Visualize AI-specific metrics in Grafana or similar.

# Grafana Dashboard JSON (example queries)
panels:
  - title: "Token Usage by Model"
    query: "sum by (model) (rate(llm_tokens_total[5m]))"
    
  - title: "Cost Per Hour"
    query: "sum(rate(llm_cost[1h])) * 3600"
    
  - title: "P95 Latency by Prompt Version"
    query: "histogram_quantile(0.95, sum by (prompt_version, le) (rate(llm_latency_bucket[5m])))"
    
  - title: "Error Rate"
    query: "sum(rate(llm_errors[5m])) / sum(rate(llm_total[5m]))"

4. Implement Cost Alerting

Prevent runaway costs with automated alerts.

@Service
public class CostMonitoringService {
    private final MeterRegistry meterRegistry;
    private final AlertService alertService;
    
    private static final double HOURLY_BUDGET = 10.0; // $10/hour
    
    @Scheduled(fixedDelay = 300000) // Every 5 minutes
    public void checkCostBudget() {
        double currentHourlyCost = meterRegistry.find("llm.cost")
            .summary()
            .map(summary -> summary.totalAmount() * 12) // * 12 to extrapolate hourly
            .orElse(0.0);
        
        if (currentHourlyCost > HOURLY_BUDGET) {
            alertService.send(
                "CRITICAL: LLM cost exceeds budget",
                String.format("Current: $%.2f/hour, Budget: $%.2f/hour", 
                    currentHourlyCost, HOURLY_BUDGET)
            );
        }
    }
}

5. Track Tool Calls Separately

Understand which tools are most used and their success rates.

@Component
@Aspect
public class ToolCallObservability {
    private final MeterRegistry meterRegistry;
    
    @Around("@annotation(tool)")
    public Object trackToolCall(ProceedingJoinPoint joinPoint, Tool tool) throws Throwable {
        Timer.Sample sample = Timer.start(meterRegistry);
        
        try {
            Object result = joinPoint.proceed();
            
            meterRegistry.counter("llm.tool.calls.success",
                "tool", tool.name()
            ).increment();
            
            return result;
        } catch (Exception e) {
            meterRegistry.counter("llm.tool.calls.failure",
                "tool", tool.name(),
                "error", e.getClass().getSimpleName()
            ).increment();
            throw e;
        } finally {
            sample.stop(meterRegistry.timer("llm.tool.duration",
                "tool", tool.name()
            ));
        }
    }
}

Code Examples

Example 1: Custom Micrometer Metrics

@Configuration
public class LLMMetricsConfiguration {
    
    @Bean
    public MeterBinder llmMetrics(ChatModel chatModel) {
        return (registry) -> {
            Gauge.builder("llm.model.info", chatModel, c -> 1.0)
                .description("LLM model information")
                .tag("model", chatModel.getClass().getSimpleName())
                .register(registry);
        };
    }
}

Example 2: Distributed Tracing with Spring Cloud Sleuth

@Service
public class RAGService {
    private final Tracer tracer;
    private final VectorStore vectorStore;
    private final ChatModel chatModel;
    
    public String answer(String question) {
        Span ragSpan = tracer.nextSpan().name("rag.answer").start();
        
        try (Tracer.SpanInScope ws = tracer.withSpan(ragSpan)) {
            // Retrieval phase
            Span retrievalSpan = tracer.nextSpan().name("rag.retrieval").start();
            List<Document> context;
            try (Tracer.SpanInScope rs = tracer.withSpan(retrievalSpan)) {
                context = vectorStore.similaritySearch(question);
                retrievalSpan.tag("documents.retrieved", String.valueOf(context.size()));
            } finally {
                retrievalSpan.end();
            }
            
            // Generation phase
            Span generationSpan = tracer.nextSpan().name("rag.generation").start();
            String answer;
            try (Tracer.SpanInScope gs = tracer.withSpan(generationSpan)) {
                answer = chatModel.call(buildPrompt(question, context));
            } finally {
                generationSpan.end();
            }
            
            return answer;
        } finally {
            ragSpan.end();
        }
    }
}

Example 3: Structured Logging with MDC

@Service
public class ConversationalService {
    private static final Logger log = LoggerFactory.getLogger(ConversationalService.class);
    
    public String chat(String userId, String message) {
        // Add context to MDC
        MDC.put("user.id", userId);
        MDC.put("session.id", generateSessionId());
        
        try {
            log.info("Received message: {}", sanitize(message));
            
            String response = chatModel.call(new Prompt(message))
                .getResult()
                .getOutput()
                .getContent();
            
            log.info("Generated response");
            
            return response;
        } finally {
            MDC.clear();
        }
    }
}

Example 4: Custom Health Indicator

@Component
public class LLMHealthIndicator implements HealthIndicator {
    private final ChatModel chatModel;
    private final MeterRegistry meterRegistry;
    
    @Override
    public Health health() {
        try {
            // Simple health check
            long start = System.currentTimeMillis();
            chatModel.call(new Prompt("test"));
            long latency = System.currentTimeMillis() - start;
            
            // Check recent error rate
            double errorRate = getRecentErrorRate();
            
            if (errorRate > 0.05) { // > 5% errors
                return Health.down()
                    .withDetail("error_rate", errorRate)
                    .build();
            }
            
            if (latency > 5000) { // > 5 seconds
                return Health.down()
                    .withDetail("latency", latency)
                    .build();
            }
            
            return Health.up()
                .withDetail("latency", latency)
                .withDetail("error_rate", errorRate)
                .build();
        } catch (Exception e) {
            return Health.down(e).build();
        }
    }
}

Example 5: Prometheus Metrics Export

# application.yaml
management:
  endpoints:
    web:
      exposure:
        include: prometheus,health,metrics
  metrics:
    tags:
      application: ${spring.application.name}
      environment: ${ENVIRONMENT:dev}
    export:
      prometheus:
        enabled: true

@RestController
@RequestMapping("/actuator/llm-metrics")
public class LLMMetricsEndpoint {
    private final MeterRegistry meterRegistry;
    
    @GetMapping
    public Map<String, Object> getLLMMetrics() {
        return Map.of(
            "total_tokens", getTotalTokens(),
            "total_cost", getTotalCost(),
            "avg_latency_ms", getAverageLatency(),
            "error_rate", getErrorRate()
        );
    }
}

Anti-Patterns

❌ No Token Tracking

// DON'T: Ignore token usage
chatModel.call(prompt); // No visibility into costs

Why: Cannot optimize costs or detect budget overruns.

✅ DO: Track every call

ChatResponse response = chatModel.call(prompt);
int tokens = response.getMetadata().getUsage().getTotalTokens();
meterRegistry.counter("llm.tokens").increment(tokens);

❌ Logging Full Prompts with PII

// DON'T: Log sensitive data
log.info("User query: {}", userMessage); // May contain SSN, credit card, etc.

Why: GDPR/CCPA violations, security risk.

✅ DO: Redact PII

log.info("User query: {}", piiRedactor.redact(userMessage));

❌ No Latency Attribution

// DON'T: Only track total time
long start = System.currentTimeMillis();
String answer = ragService.answer(question);
log.info("Total time: {}ms", System.currentTimeMillis() - start);

Why: Cannot identify bottlenecks (retrieval vs. generation).

✅ DO: Track per component

long retrievalTime = measureRetrieval();
long generationTime = measureGeneration();
log.info("Retrieval: {}ms, Generation: {}ms", retrievalTime, generationTime);

References

chat-models.md — LLM integration
evaluation.md — Quality metrics
failure-handling.md — Error tracking
spring/actuator.md — Spring Boot monitoring

Observability

Observability

Overview

Key Concepts

The Three Pillars of AI Observability

Critical Metrics to Track

Best Practices

1. Instrument All LLM Calls

2. Log Prompts and Responses (with PII Redaction)

3. Create Custom Dashboards

4. Implement Cost Alerting

5. Track Tool Calls Separately

Code Examples

Example 1: Custom Micrometer Metrics

Example 2: Distributed Tracing with Spring Cloud Sleuth

Example 3: Structured Logging with MDC

Example 4: Custom Health Indicator

Example 5: Prometheus Metrics Export

Anti-Patterns

❌ No Token Tracking

❌ Logging Full Prompts with PII

❌ No Latency Attribution

References

Related Skills