Observability
Observability
Overview
Observability in Spring AI applications requires tracking LLM-specific metrics beyond traditional application monitoring: token usage, prompt versions, model latency, cost attribution, and response quality. Effective observability enables cost optimization, SLO enforcement, and rapid incident response in production AI systems.
Key Concepts
The Three Pillars of AI Observability
┌─────────────────────────────────────────────────────────────┐
│ AI Observability Stack │
├─────────────────────────────────────────────────────────────┤
│ │
│ METRICS (Quantitative) │
│ ────────────────────── │
│ - Token usage (input/output) │
│ - Latency (P50, P95, P99) │
│ - Cost per request │
│ - Error rates (timeout, rate limit, server error) │
│ - Cache hit rates │
│ │
│ LOGS (Contextual) │
│ ──────────────── │
│ - Prompt text (sampled or hashed) │
│ - Model responses │
│ - Tool calls and arguments │
│ - User IDs and session IDs │
│ - Error messages and stack traces │
│ │
│ TRACES (Distributed) │
│ ────────────────── │
│ - End-to-end request flow │
│ - Retrieval → Generation → Post-processing │
│ - Multiple model calls in agentic workflows │
│ - External API calls (VectorStore, databases) │
│ - Latency attribution per component │
│ │
└─────────────────────────────────────────────────────────────┘
Critical Metrics to Track
| Metric | Why It Matters | SLO Example |
|---|---|---|
| Input Tokens | Cost attribution | < 2000 tokens/request |
| Output Tokens | Cost, latency | < 500 tokens/response |
| Total Cost | Budget control | < $0.10/request |
| P95 Latency | User experience | < 2 seconds |
| Error Rate | Reliability | < 0.1% |
| Cache Hit Rate | Cost optimization | > 30% |
| Prompt Version | Regression tracking | N/A |
| Model Used | Cost/quality tradeoff | N/A |
Best Practices
1. Instrument All LLM Calls
Use AOP to automatically track metrics without cluttering business logic.
@Component
@Aspect
public class LLMObservabilityAspect {
private final MeterRegistry meterRegistry;
private final Tracer tracer;
@Around("execution(* org.springframework.ai.chat.ChatModel.call(..))")
public Object trackChatModelCall(ProceedingJoinPoint joinPoint) throws Throwable {
Span span = tracer.nextSpan().name("llm.call").start();
Timer.Sample sample = Timer.start(meterRegistry);
try (Tracer.SpanInScope ws = tracer.withSpan(span)) {
Prompt prompt = (Prompt) joinPoint.getArgs()[0];
// Extract metadata
String promptVersion = extractPromptVersion(prompt);
String model = extractModel(prompt);
// Add trace tags
span.tag("prompt.version", promptVersion);
span.tag("model", model);
// Execute
ChatResponse response = (ChatResponse) joinPoint.proceed();
// Track token usage
int inputTokens = response.getMetadata().getUsage().getPromptTokens();
int outputTokens = response.getMetadata().getUsage().getGenerationTokens();
int totalTokens = response.getMetadata().getUsage().getTotalTokens();
meterRegistry.counter("llm.tokens.input",
"model", model,
"prompt.version", promptVersion
).increment(inputTokens);
meterRegistry.counter("llm.tokens.output",
"model", model,
"prompt.version", promptVersion
).increment(outputTokens);
// Track cost
double cost = estimateCost(model, inputTokens, outputTokens);
meterRegistry.summary("llm.cost",
"model", model
).record(cost);
span.tag("tokens.input", String.valueOf(inputTokens));
span.tag("tokens.output", String.valueOf(outputTokens));
span.tag("cost", String.format("%.4f", cost));
return response;
} catch (Exception e) {
span.tag("error", e.getClass().getSimpleName());
meterRegistry.counter("llm.errors",
"error.type", e.getClass().getSimpleName()
).increment();
throw e;
} finally {
sample.stop(meterRegistry.timer("llm.latency",
"model", extractModelFromContext()
));
span.end();
}
}
private double estimateCost(String model, int inputTokens, int outputTokens) {
return switch (model) {
case "gpt-4-turbo" -> (inputTokens * 0.00001) + (outputTokens * 0.00003);
case "gpt-3.5-turbo" -> (inputTokens * 0.0000005) + (outputTokens * 0.0000015);
case "claude-3-opus" -> (inputTokens * 0.000015) + (outputTokens * 0.000075);
default -> 0.0;
};
}
}
2. Log Prompts and Responses (with PII Redaction)
Enable debugging without compromising user privacy.
@Component
@Aspect
public class LLMLoggingAspect {
private static final Logger log = LoggerFactory.getLogger(LLMLoggingAspect.class);
private final PIIRedactor piiRedactor;
@AfterReturning(
pointcut = "execution(* org.springframework.ai.chat.ChatModel.call(..))",
returning = "response"
)
public void logLLMInteraction(JoinPoint joinPoint, ChatResponse response) {
Prompt prompt = (Prompt) joinPoint.getArgs()[0];
String promptText = prompt.getContents();
String responseText = response.getResult().getOutput().getContent();
// Redact PII
String safePrompt = piiRedactor.redact(promptText);
String safeResponse = piiRedactor.redact(responseText);
// Structured logging
log.info("LLM Interaction: prompt={}, response={}, tokens={}, latency={}ms",
safePrompt.substring(0, Math.min(100, safePrompt.length())), // First 100 chars
safeResponse.substring(0, Math.min(100, safeResponse.length())),
response.getMetadata().getUsage().getTotalTokens(),
response.getMetadata().getRateLimit().getRequestsRemaining()
);
}
}
3. Create Custom Dashboards
Visualize AI-specific metrics in Grafana or similar.
# Grafana Dashboard JSON (example queries)
panels:
- title: "Token Usage by Model"
query: "sum by (model) (rate(llm_tokens_total[5m]))"
- title: "Cost Per Hour"
query: "sum(rate(llm_cost[1h])) * 3600"
- title: "P95 Latency by Prompt Version"
query: "histogram_quantile(0.95, sum by (prompt_version, le) (rate(llm_latency_bucket[5m])))"
- title: "Error Rate"
query: "sum(rate(llm_errors[5m])) / sum(rate(llm_total[5m]))"
4. Implement Cost Alerting
Prevent runaway costs with automated alerts.
@Service
public class CostMonitoringService {
private final MeterRegistry meterRegistry;
private final AlertService alertService;
private static final double HOURLY_BUDGET = 10.0; // $10/hour
@Scheduled(fixedDelay = 300000) // Every 5 minutes
public void checkCostBudget() {
double currentHourlyCost = meterRegistry.find("llm.cost")
.summary()
.map(summary -> summary.totalAmount() * 12) // * 12 to extrapolate hourly
.orElse(0.0);
if (currentHourlyCost > HOURLY_BUDGET) {
alertService.send(
"CRITICAL: LLM cost exceeds budget",
String.format("Current: $%.2f/hour, Budget: $%.2f/hour",
currentHourlyCost, HOURLY_BUDGET)
);
}
}
}
5. Track Tool Calls Separately
Understand which tools are most used and their success rates.
@Component
@Aspect
public class ToolCallObservability {
private final MeterRegistry meterRegistry;
@Around("@annotation(tool)")
public Object trackToolCall(ProceedingJoinPoint joinPoint, Tool tool) throws Throwable {
Timer.Sample sample = Timer.start(meterRegistry);
try {
Object result = joinPoint.proceed();
meterRegistry.counter("llm.tool.calls.success",
"tool", tool.name()
).increment();
return result;
} catch (Exception e) {
meterRegistry.counter("llm.tool.calls.failure",
"tool", tool.name(),
"error", e.getClass().getSimpleName()
).increment();
throw e;
} finally {
sample.stop(meterRegistry.timer("llm.tool.duration",
"tool", tool.name()
));
}
}
}
Code Examples
Example 1: Custom Micrometer Metrics
@Configuration
public class LLMMetricsConfiguration {
@Bean
public MeterBinder llmMetrics(ChatModel chatModel) {
return (registry) -> {
Gauge.builder("llm.model.info", chatModel, c -> 1.0)
.description("LLM model information")
.tag("model", chatModel.getClass().getSimpleName())
.register(registry);
};
}
}
Example 2: Distributed Tracing with Spring Cloud Sleuth
@Service
public class RAGService {
private final Tracer tracer;
private final VectorStore vectorStore;
private final ChatModel chatModel;
public String answer(String question) {
Span ragSpan = tracer.nextSpan().name("rag.answer").start();
try (Tracer.SpanInScope ws = tracer.withSpan(ragSpan)) {
// Retrieval phase
Span retrievalSpan = tracer.nextSpan().name("rag.retrieval").start();
List<Document> context;
try (Tracer.SpanInScope rs = tracer.withSpan(retrievalSpan)) {
context = vectorStore.similaritySearch(question);
retrievalSpan.tag("documents.retrieved", String.valueOf(context.size()));
} finally {
retrievalSpan.end();
}
// Generation phase
Span generationSpan = tracer.nextSpan().name("rag.generation").start();
String answer;
try (Tracer.SpanInScope gs = tracer.withSpan(generationSpan)) {
answer = chatModel.call(buildPrompt(question, context));
} finally {
generationSpan.end();
}
return answer;
} finally {
ragSpan.end();
}
}
}
Example 3: Structured Logging with MDC
@Service
public class ConversationalService {
private static final Logger log = LoggerFactory.getLogger(ConversationalService.class);
public String chat(String userId, String message) {
// Add context to MDC
MDC.put("user.id", userId);
MDC.put("session.id", generateSessionId());
try {
log.info("Received message: {}", sanitize(message));
String response = chatModel.call(new Prompt(message))
.getResult()
.getOutput()
.getContent();
log.info("Generated response");
return response;
} finally {
MDC.clear();
}
}
}
Example 4: Custom Health Indicator
@Component
public class LLMHealthIndicator implements HealthIndicator {
private final ChatModel chatModel;
private final MeterRegistry meterRegistry;
@Override
public Health health() {
try {
// Simple health check
long start = System.currentTimeMillis();
chatModel.call(new Prompt("test"));
long latency = System.currentTimeMillis() - start;
// Check recent error rate
double errorRate = getRecentErrorRate();
if (errorRate > 0.05) { // > 5% errors
return Health.down()
.withDetail("error_rate", errorRate)
.build();
}
if (latency > 5000) { // > 5 seconds
return Health.down()
.withDetail("latency", latency)
.build();
}
return Health.up()
.withDetail("latency", latency)
.withDetail("error_rate", errorRate)
.build();
} catch (Exception e) {
return Health.down(e).build();
}
}
}
Example 5: Prometheus Metrics Export
# application.yaml
management:
endpoints:
web:
exposure:
include: prometheus,health,metrics
metrics:
tags:
application: ${spring.application.name}
environment: ${ENVIRONMENT:dev}
export:
prometheus:
enabled: true
@RestController
@RequestMapping("/actuator/llm-metrics")
public class LLMMetricsEndpoint {
private final MeterRegistry meterRegistry;
@GetMapping
public Map<String, Object> getLLMMetrics() {
return Map.of(
"total_tokens", getTotalTokens(),
"total_cost", getTotalCost(),
"avg_latency_ms", getAverageLatency(),
"error_rate", getErrorRate()
);
}
}
Anti-Patterns
❌ No Token Tracking
// DON'T: Ignore token usage
chatModel.call(prompt); // No visibility into costs
Why: Cannot optimize costs or detect budget overruns.
✅ DO: Track every call
ChatResponse response = chatModel.call(prompt);
int tokens = response.getMetadata().getUsage().getTotalTokens();
meterRegistry.counter("llm.tokens").increment(tokens);
❌ Logging Full Prompts with PII
// DON'T: Log sensitive data
log.info("User query: {}", userMessage); // May contain SSN, credit card, etc.
Why: GDPR/CCPA violations, security risk.
✅ DO: Redact PII
log.info("User query: {}", piiRedactor.redact(userMessage));
❌ No Latency Attribution
// DON'T: Only track total time
long start = System.currentTimeMillis();
String answer = ragService.answer(question);
log.info("Total time: {}ms", System.currentTimeMillis() - start);
Why: Cannot identify bottlenecks (retrieval vs. generation).
✅ DO: Track per component
long retrievalTime = measureRetrieval();
long generationTime = measureGeneration();
log.info("Retrieval: {}ms, Generation: {}ms", retrievalTime, generationTime);
References
Related Skills
chat-models.md— LLM integrationevaluation.md— Quality metricsfailure-handling.md— Error trackingspring/actuator.md— Spring Boot monitoring