Failure Handling
Failure Handling
Overview
LLM APIs are inherently unreliable: they experience rate limits, timeouts, transient errors, and degraded performance. Production Spring AI applications must implement comprehensive failure handling including timeouts, retries with exponential backoff, circuit breakers, and graceful degradation. Never let LLM failures cascade into application failures.
Key Concepts
Common LLM Failure Modes
┌─────────────────────────────────────────────────────────────┐
│ LLM API Failure Taxonomy │
├─────────────────────────────────────────────────────────────┤
│ │
│ TRANSIENT FAILURES (Retry-able) │
│ ──────────────────────────────── │
│ - 429 Rate Limit Exceeded → Retry with backoff │
│ - 500 Internal Server Error → Retry up to 3 times │
│ - 503 Service Unavailable → Retry with backoff │
│ - Network timeout → Retry with timeout │
│ - Socket connection reset → Retry │
│ │
│ PERMANENT FAILURES (Don't Retry) │
│ ────────────────────────────────── │
│ - 400 Bad Request (invalid input) → Fix prompt, don't retry │
│ - 401 Unauthorized (bad API key) → Alert, don't retry │
│ - 404 Model Not Found → Fix config, don't retry │
│ - Token limit exceeded → Reduce prompt, don't retry│
│ │
│ DEGRADED PERFORMANCE (Fallback) │
│ ───────────────────────────────── │
│ - Latency > P95 SLO → Switch to faster model │
│ - Cost > budget threshold → Use cheaper model │
│ - Quality degradation → Revert prompt version │
│ │
└─────────────────────────────────────────────────────────────┘
Resilience Pattern Stack
┌──────────────────────────────────────────────────┐
│ Resilience Pattern Layers │
├──────────────────────────────────────────────────┤
│ │
│ TIMEOUT (Outermost) │
│ ─────── │
│ Prevent indefinite hangs │
│ Typical: 10-30 seconds for LLM calls │
│ │
│ │ │
│ ▼ │
│ CIRCUIT BREAKER │
│ ──────────────── │
│ Stop calling failing service │
│ Open after 5 consecutive failures │
│ Half-open after 30 seconds │
│ │
│ │ │
│ ▼ │
│ RETRY WITH BACKOFF │
│ ─────────────────── │
│ Retry transient errors │
│ Exponential backoff: 1s, 2s, 4s, 8s │
│ Max 3 retries │
│ │
│ │ │
│ ▼ │
│ FALLBACK │
│ ──────── │
│ Return degraded response on total failure │
│ Use cached response or simpler model │
│ │
└──────────────────────────────────────────────────┘
Best Practices
1. Configure Timeouts for All LLM Calls
Never allow unbounded waits.
@Configuration
public class LLMTimeoutConfig {
@Bean
public RestTemplate llmRestTemplate(RestTemplateBuilder builder) {
return builder
.setConnectTimeout(Duration.ofSeconds(5)) // Connection timeout
.setReadTimeout(Duration.ofSeconds(30)) // Read timeout
.build();
}
@Bean
public ChatModel chatModelWithTimeout(OpenAiApi openAiApi) {
return new OpenAiChatModel(
openAiApi,
ChatOptions.builder()
.withModel("gpt-4-turbo")
.build(),
Duration.ofSeconds(30) // Request timeout
);
}
}
2. Implement Circuit Breaker Pattern
Use Resilience4j to prevent cascading failures.
@Service
public class ResilientChatService {
private final ChatModel chatModel;
private final CircuitBreakerRegistry circuitBreakerRegistry;
@CircuitBreaker(name = "llm", fallbackMethod = "fallbackResponse")
@Retry(name = "llm", fallbackMethod = "fallbackResponse")
@TimeLimiter(name = "llm")
public String chat(String message) {
return chatModel.call(new Prompt(message))
.getResult()
.getOutput()
.getContent();
}
private String fallbackResponse(String message, Exception e) {
log.warn("LLM call failed, using fallback", e);
return "I'm experiencing technical difficulties. Please try again in a moment.";
}
}
Configuration:
# application.yaml
resilience4j:
circuitbreaker:
instances:
llm:
sliding-window-size: 10
failure-rate-threshold: 50 # Open after 50% failures
wait-duration-in-open-state: 30s
permitted-number-of-calls-in-half-open-state: 3
retry:
instances:
llm:
max-attempts: 3
wait-duration: 1s
exponential-backoff-multiplier: 2
retry-exceptions:
- org.springframework.web.client.HttpServerErrorException
- java.net.SocketTimeoutException
timelimiter:
instances:
llm:
timeout-duration: 30s
3. Implement Smart Retry Logic
Only retry transient errors; fail fast on permanent errors.
@Service
public class SmartRetryService {
private final ChatModel chatModel;
public String chat(String message) {
int attempts = 0;
int maxAttempts = 3;
long backoffMs = 1000;
while (attempts < maxAttempts) {
try {
return chatModel.call(new Prompt(message))
.getResult()
.getOutput()
.getContent();
} catch (HttpClientErrorException e) {
// 4xx errors: don't retry
throw new PermanentLLMException("Invalid request", e);
} catch (HttpServerErrorException e) {
// 5xx errors: retry with backoff
attempts++;
if (attempts >= maxAttempts) {
throw new TransientLLMException("Max retries exceeded", e);
}
log.warn("LLM call failed (attempt {}/{}), retrying in {}ms",
attempts, maxAttempts, backoffMs);
try {
Thread.sleep(backoffMs);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
throw new RuntimeException(ie);
}
backoffMs *= 2; // Exponential backoff
}
}
throw new TransientLLMException("Unexpected failure");
}
}
4. Implement Multi-Model Fallback
Degrade to cheaper/faster model on primary failure.
@Service
public class MultiModelService {
private final ChatModel primaryModel; // GPT-4
private final ChatModel fallbackModel; // GPT-3.5
private final ChatModel localModel; // Ollama (offline)
@CircuitBreaker(name = "primary", fallbackMethod = "useFallbackModel")
public String chat(String message) {
return primaryModel.call(new Prompt(message))
.getResult()
.getOutput()
.getContent();
}
@CircuitBreaker(name = "fallback", fallbackMethod = "useLocalModel")
private String useFallbackModel(String message, Exception e) {
log.warn("Primary model failed, using fallback model", e);
return fallbackModel.call(new Prompt(message))
.getResult()
.getOutput()
.getContent();
}
private String useLocalModel(String message, Exception e) {
log.error("Both primary and fallback models failed, using local model", e);
return localModel.call(new Prompt(message))
.getResult()
.getOutput()
.getContent();
}
}
5. Cache Responses for Identical Queries
Avoid redundant LLM calls; improve reliability and cost.
@Service
public class CachedLLMService {
private final ChatModel chatModel;
private final Cache<String, String> responseCache = Caffeine.newBuilder()
.maximumSize(10000)
.expireAfterWrite(Duration.ofHours(1))
.recordStats()
.build();
public String chat(String message) {
String cached = responseCache.getIfPresent(message);
if (cached != null) {
log.info("Cache hit for message hash: {}", hashMessage(message));
return cached;
}
try {
String response = chatModel.call(new Prompt(message))
.getResult()
.getOutput()
.getContent();
responseCache.put(message, response);
return response;
} catch (Exception e) {
// On failure, check if we have stale cache
String stale = responseCache.getIfPresent(message);
if (stale != null) {
log.warn("LLM failed, returning stale cached response", e);
return stale + " (Note: This is a cached response)";
}
throw e;
}
}
}
Code Examples
Example 1: Basic Timeout Handling
@Service
public class TimeoutAwareChatService {
private final ChatModel chatModel;
private static final Duration TIMEOUT = Duration.ofSeconds(30);
public String chat(String message) {
CompletableFuture<String> future = CompletableFuture.supplyAsync(() ->
chatModel.call(new Prompt(message))
.getResult()
.getOutput()
.getContent()
);
try {
return future.get(TIMEOUT.toSeconds(), TimeUnit.SECONDS);
} catch (TimeoutException e) {
future.cancel(true);
throw new LLMTimeoutException("LLM call exceeded timeout", e);
} catch (InterruptedException | ExecutionException e) {
throw new LLMException("LLM call failed", e);
}
}
}
✅ Good for: Simple timeout enforcement
❌ Not good for: Complex retry logic (use Resilience4j)
Example 2: Rate Limit Handling
@Service
public class RateLimitAwareService {
private final ChatModel chatModel;
private final RateLimiter rateLimiter = RateLimiter.create(10.0); // 10 requests/second
public String chat(String message) {
// Wait for rate limit permit
if (!rateLimiter.tryAcquire(5, TimeUnit.SECONDS)) {
throw new RateLimitException("Rate limit exceeded");
}
try {
return chatModel.call(new Prompt(message))
.getResult()
.getOutput()
.getContent();
} catch (HttpClientErrorException.TooManyRequests e) {
// Provider-side rate limit
String retryAfter = e.getResponseHeaders().getFirst("Retry-After");
int waitSeconds = retryAfter != null ? Integer.parseInt(retryAfter) : 60;
log.warn("Provider rate limit hit, waiting {} seconds", waitSeconds);
Thread.sleep(waitSeconds * 1000);
// Retry once
return chatModel.call(new Prompt(message))
.getResult()
.getOutput()
.getContent();
}
}
}
✅ Good for: Preventing rate limit errors
❌ Not good for: High-throughput scenarios (use queue)
Example 3: Graceful Degradation
@Service
public class DegradedModeService {
private final ChatModel chatModel;
private final Cache<String, String> emergencyCache;
public String chat(String message) {
try {
return callWithQualityCheck(message);
} catch (Exception e) {
log.error("LLM call failed, attempting graceful degradation", e);
// Fallback 1: Cached response
String cached = emergencyCache.getIfPresent(message);
if (cached != null) {
return "[CACHED] " + cached;
}
// Fallback 2: Rule-based response
String ruleBasedResponse = getRuleBasedResponse(message);
if (ruleBasedResponse != null) {
return ruleBasedResponse;
}
// Fallback 3: Generic error message
return "I'm currently experiencing technical difficulties. " +
"Please contact support@example.com for immediate assistance.";
}
}
private String getRuleBasedResponse(String message) {
String lowerMessage = message.toLowerCase();
if (lowerMessage.contains("refund") || lowerMessage.contains("return")) {
return "Our refund policy allows returns within 30 days. " +
"Visit example.com/refunds for details.";
}
if (lowerMessage.contains("shipping") || lowerMessage.contains("delivery")) {
return "Standard shipping takes 5-7 business days. " +
"Track your order at example.com/track";
}
return null;
}
}
✅ Good for: Always-available systems
❌ Not good for: When accuracy is critical (fail instead)
Example 4: Async Processing with Timeout
@Service
public class AsyncLLMService {
private final ChatModel chatModel;
private final ExecutorService executor = Executors.newFixedThreadPool(10);
public CompletableFuture<String> chatAsync(String message) {
return CompletableFuture.supplyAsync(() -> {
try {
return chatModel.call(new Prompt(message))
.getResult()
.getOutput()
.getContent();
} catch (Exception e) {
throw new CompletionException(e);
}
}, executor)
.orTimeout(30, TimeUnit.SECONDS)
.exceptionally(e -> {
if (e instanceof TimeoutException) {
log.error("LLM call timed out after 30 seconds");
return "Request timed out. Please try a simpler query.";
}
log.error("LLM call failed", e);
return "An error occurred. Please try again.";
});
}
}
✅ Good for: Non-blocking user interfaces
❌ Not good for: Synchronous workflows
Example 5: Health-Based Model Selection
@Service
public class HealthAwareModelRouter {
private final List<ModelWithHealth> models;
public String chat(String message) {
// Try models in order of health
List<ModelWithHealth> sorted = models.stream()
.sorted(Comparator.comparing(ModelWithHealth::getHealth).reversed())
.toList();
for (ModelWithHealth model : sorted) {
if (model.getHealth() < 0.3) {
log.warn("Skipping unhealthy model: {}", model.getName());
continue;
}
try {
String response = model.getChatModel().call(new Prompt(message))
.getResult()
.getOutput()
.getContent();
model.recordSuccess();
return response;
} catch (Exception e) {
model.recordFailure();
log.warn("Model {} failed, trying next", model.getName(), e);
}
}
throw new AllModelsFailedException("All models unhealthy");
}
static class ModelWithHealth {
private final String name;
private final ChatModel chatModel;
private final AtomicInteger successCount = new AtomicInteger(0);
private final AtomicInteger failureCount = new AtomicInteger(0);
public double getHealth() {
int total = successCount.get() + failureCount.get();
return total == 0 ? 1.0 : (double) successCount.get() / total;
}
public void recordSuccess() {
successCount.incrementAndGet();
// Decay old failures
if (successCount.get() % 10 == 0) {
failureCount.set(failureCount.get() / 2);
}
}
public void recordFailure() {
failureCount.incrementAndGet();
}
}
}
✅ Good for: Multi-provider setups
❌ Not good for: Single provider (no alternatives)
Anti-Patterns
❌ No Timeout
// DON'T: Can hang indefinitely
String response = chatModel.call(prompt);
Why: LLM APIs can hang; ties up threads.
✅ DO: Always set timeout
@TimeLimiter(name = "llm", timeout = 30s)
String response = chatModel.call(prompt);
❌ Infinite Retries
// DON'T: Retry forever
while (true) {
try {
return chatModel.call(prompt);
} catch (Exception e) {
// Retry
}
}
Why: Wastes resources; may never recover.
✅ DO: Limit retries
@Retry(name = "llm", maxAttempts = 3)
return chatModel.call(prompt);
❌ Retrying Permanent Errors
// DON'T: Retry 401 Unauthorized
@Retry(name = "llm")
return chatModel.call(prompt); // Retries all errors!
Why: API key won’t magically fix itself.
✅ DO: Only retry transient errors
resilience4j.retry.instances.llm:
retry-exceptions:
- HttpServerErrorException
- SocketTimeoutException
ignore-exceptions:
- HttpClientErrorException
Testing Strategies
Chaos Testing
@SpringBootTest
class ChaosTest {
@Autowired
private ResilientChatService chatService;
@MockBean
private ChatModel chatModel;
@Test
void shouldHandleTransientFailures() {
// Fail first 2 calls, succeed on 3rd
when(chatModel.call(any()))
.thenThrow(new HttpServerErrorException(HttpStatus.INTERNAL_SERVER_ERROR))
.thenThrow(new HttpServerErrorException(HttpStatus.INTERNAL_SERVER_ERROR))
.thenReturn(new ChatResponse(...));
String result = chatService.chat("test");
assertNotNull(result);
verify(chatModel, times(3)).call(any());
}
@Test
void shouldUseFallbackAfterMaxRetries() {
when(chatModel.call(any()))
.thenThrow(new HttpServerErrorException(HttpStatus.INTERNAL_SERVER_ERROR));
String result = chatService.chat("test");
assertTrue(result.contains("technical difficulties"));
}
}
References
Related Skills
observability.md— Error trackingchat-models.md— LLM integrationresilience/circuit-breaker.md— Resilience patterns