Failure Handling

Overview

LLM APIs are inherently unreliable: they experience rate limits, timeouts, transient errors, and degraded performance. Production Spring AI applications must implement comprehensive failure handling including timeouts, retries with exponential backoff, circuit breakers, and graceful degradation. Never let LLM failures cascade into application failures.

Key Concepts

Common LLM Failure Modes

┌─────────────────────────────────────────────────────────────┐
│                LLM API Failure Taxonomy                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│ TRANSIENT FAILURES (Retry-able)                              │
│ ────────────────────────────────                             │
│ - 429 Rate Limit Exceeded       → Retry with backoff         │
│ - 500 Internal Server Error     → Retry up to 3 times        │
│ - 503 Service Unavailable       → Retry with backoff         │
│ - Network timeout               → Retry with timeout         │
│ - Socket connection reset       → Retry                      │
│                                                              │
│ PERMANENT FAILURES (Don't Retry)                             │
│ ──────────────────────────────────                           │
│ - 400 Bad Request (invalid input) → Fix prompt, don't retry  │
│ - 401 Unauthorized (bad API key)  → Alert, don't retry       │
│ - 404 Model Not Found             → Fix config, don't retry  │
│ - Token limit exceeded            → Reduce prompt, don't retry│
│                                                              │
│ DEGRADED PERFORMANCE (Fallback)                              │
│ ─────────────────────────────────                            │
│ - Latency > P95 SLO              → Switch to faster model    │
│ - Cost > budget threshold        → Use cheaper model         │
│ - Quality degradation            → Revert prompt version     │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Resilience Pattern Stack

┌──────────────────────────────────────────────────┐
│           Resilience Pattern Layers               │
├──────────────────────────────────────────────────┤
│                                                   │
│  TIMEOUT (Outermost)                              │
│  ───────                                          │
│  Prevent indefinite hangs                         │
│  Typical: 10-30 seconds for LLM calls             │
│                                                   │
│    │                                              │
│    ▼                                              │
│  CIRCUIT BREAKER                                  │
│  ────────────────                                 │
│  Stop calling failing service                     │
│  Open after 5 consecutive failures                │
│  Half-open after 30 seconds                       │
│                                                   │
│    │                                              │
│    ▼                                              │
│  RETRY WITH BACKOFF                               │
│  ───────────────────                              │
│  Retry transient errors                           │
│  Exponential backoff: 1s, 2s, 4s, 8s              │
│  Max 3 retries                                    │
│                                                   │
│    │                                              │
│    ▼                                              │
│  FALLBACK                                         │
│  ────────                                         │
│  Return degraded response on total failure        │
│  Use cached response or simpler model             │
│                                                   │
└──────────────────────────────────────────────────┘

Best Practices

1. Configure Timeouts for All LLM Calls

Never allow unbounded waits.

@Configuration
public class LLMTimeoutConfig {
    
    @Bean
    public RestTemplate llmRestTemplate(RestTemplateBuilder builder) {
        return builder
            .setConnectTimeout(Duration.ofSeconds(5))  // Connection timeout
            .setReadTimeout(Duration.ofSeconds(30))    // Read timeout
            .build();
    }
    
    @Bean
    public ChatModel chatModelWithTimeout(OpenAiApi openAiApi) {
        return new OpenAiChatModel(
            openAiApi,
            ChatOptions.builder()
                .withModel("gpt-4-turbo")
                .build(),
            Duration.ofSeconds(30) // Request timeout
        );
    }
}

2. Implement Circuit Breaker Pattern

Use Resilience4j to prevent cascading failures.

@Service
public class ResilientChatService {
    private final ChatModel chatModel;
    private final CircuitBreakerRegistry circuitBreakerRegistry;
    
    @CircuitBreaker(name = "llm", fallbackMethod = "fallbackResponse")
    @Retry(name = "llm", fallbackMethod = "fallbackResponse")
    @TimeLimiter(name = "llm")
    public String chat(String message) {
        return chatModel.call(new Prompt(message))
            .getResult()
            .getOutput()
            .getContent();
    }
    
    private String fallbackResponse(String message, Exception e) {
        log.warn("LLM call failed, using fallback", e);
        return "I'm experiencing technical difficulties. Please try again in a moment.";
    }
}

Configuration:

# application.yaml
resilience4j:
  circuitbreaker:
    instances:
      llm:
        sliding-window-size: 10
        failure-rate-threshold: 50  # Open after 50% failures
        wait-duration-in-open-state: 30s
        permitted-number-of-calls-in-half-open-state: 3
  
  retry:
    instances:
      llm:
        max-attempts: 3
        wait-duration: 1s
        exponential-backoff-multiplier: 2
        retry-exceptions:
          - org.springframework.web.client.HttpServerErrorException
          - java.net.SocketTimeoutException
  
  timelimiter:
    instances:
      llm:
        timeout-duration: 30s

3. Implement Smart Retry Logic

Only retry transient errors; fail fast on permanent errors.

@Service
public class SmartRetryService {
    private final ChatModel chatModel;
    
    public String chat(String message) {
        int attempts = 0;
        int maxAttempts = 3;
        long backoffMs = 1000;
        
        while (attempts < maxAttempts) {
            try {
                return chatModel.call(new Prompt(message))
                    .getResult()
                    .getOutput()
                    .getContent();
            } catch (HttpClientErrorException e) {
                // 4xx errors: don't retry
                throw new PermanentLLMException("Invalid request", e);
            } catch (HttpServerErrorException e) {
                // 5xx errors: retry with backoff
                attempts++;
                if (attempts >= maxAttempts) {
                    throw new TransientLLMException("Max retries exceeded", e);
                }
                
                log.warn("LLM call failed (attempt {}/{}), retrying in {}ms", 
                    attempts, maxAttempts, backoffMs);
                
                try {
                    Thread.sleep(backoffMs);
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                    throw new RuntimeException(ie);
                }
                
                backoffMs *= 2; // Exponential backoff
            }
        }
        
        throw new TransientLLMException("Unexpected failure");
    }
}

4. Implement Multi-Model Fallback

Degrade to cheaper/faster model on primary failure.

@Service
public class MultiModelService {
    private final ChatModel primaryModel;   // GPT-4
    private final ChatModel fallbackModel;  // GPT-3.5
    private final ChatModel localModel;     // Ollama (offline)
    
    @CircuitBreaker(name = "primary", fallbackMethod = "useFallbackModel")
    public String chat(String message) {
        return primaryModel.call(new Prompt(message))
            .getResult()
            .getOutput()
            .getContent();
    }
    
    @CircuitBreaker(name = "fallback", fallbackMethod = "useLocalModel")
    private String useFallbackModel(String message, Exception e) {
        log.warn("Primary model failed, using fallback model", e);
        return fallbackModel.call(new Prompt(message))
            .getResult()
            .getOutput()
            .getContent();
    }
    
    private String useLocalModel(String message, Exception e) {
        log.error("Both primary and fallback models failed, using local model", e);
        return localModel.call(new Prompt(message))
            .getResult()
            .getOutput()
            .getContent();
    }
}

5. Cache Responses for Identical Queries

Avoid redundant LLM calls; improve reliability and cost.

@Service
public class CachedLLMService {
    private final ChatModel chatModel;
    private final Cache<String, String> responseCache = Caffeine.newBuilder()
        .maximumSize(10000)
        .expireAfterWrite(Duration.ofHours(1))
        .recordStats()
        .build();
    
    public String chat(String message) {
        String cached = responseCache.getIfPresent(message);
        if (cached != null) {
            log.info("Cache hit for message hash: {}", hashMessage(message));
            return cached;
        }
        
        try {
            String response = chatModel.call(new Prompt(message))
                .getResult()
                .getOutput()
                .getContent();
            
            responseCache.put(message, response);
            return response;
        } catch (Exception e) {
            // On failure, check if we have stale cache
            String stale = responseCache.getIfPresent(message);
            if (stale != null) {
                log.warn("LLM failed, returning stale cached response", e);
                return stale + " (Note: This is a cached response)";
            }
            throw e;
        }
    }
}

Code Examples

Example 1: Basic Timeout Handling

@Service
public class TimeoutAwareChatService {
    private final ChatModel chatModel;
    private static final Duration TIMEOUT = Duration.ofSeconds(30);
    
    public String chat(String message) {
        CompletableFuture<String> future = CompletableFuture.supplyAsync(() ->
            chatModel.call(new Prompt(message))
                .getResult()
                .getOutput()
                .getContent()
        );
        
        try {
            return future.get(TIMEOUT.toSeconds(), TimeUnit.SECONDS);
        } catch (TimeoutException e) {
            future.cancel(true);
            throw new LLMTimeoutException("LLM call exceeded timeout", e);
        } catch (InterruptedException | ExecutionException e) {
            throw new LLMException("LLM call failed", e);
        }
    }
}

✅ Good for: Simple timeout enforcement
❌ Not good for: Complex retry logic (use Resilience4j)

Example 2: Rate Limit Handling

@Service
public class RateLimitAwareService {
    private final ChatModel chatModel;
    private final RateLimiter rateLimiter = RateLimiter.create(10.0); // 10 requests/second
    
    public String chat(String message) {
        // Wait for rate limit permit
        if (!rateLimiter.tryAcquire(5, TimeUnit.SECONDS)) {
            throw new RateLimitException("Rate limit exceeded");
        }
        
        try {
            return chatModel.call(new Prompt(message))
                .getResult()
                .getOutput()
                .getContent();
        } catch (HttpClientErrorException.TooManyRequests e) {
            // Provider-side rate limit
            String retryAfter = e.getResponseHeaders().getFirst("Retry-After");
            int waitSeconds = retryAfter != null ? Integer.parseInt(retryAfter) : 60;
            
            log.warn("Provider rate limit hit, waiting {} seconds", waitSeconds);
            Thread.sleep(waitSeconds * 1000);
            
            // Retry once
            return chatModel.call(new Prompt(message))
                .getResult()
                .getOutput()
                .getContent();
        }
    }
}

✅ Good for: Preventing rate limit errors
❌ Not good for: High-throughput scenarios (use queue)

Example 3: Graceful Degradation

@Service
public class DegradedModeService {
    private final ChatModel chatModel;
    private final Cache<String, String> emergencyCache;
    
    public String chat(String message) {
        try {
            return callWithQualityCheck(message);
        } catch (Exception e) {
            log.error("LLM call failed, attempting graceful degradation", e);
            
            // Fallback 1: Cached response
            String cached = emergencyCache.getIfPresent(message);
            if (cached != null) {
                return "[CACHED] " + cached;
            }
            
            // Fallback 2: Rule-based response
            String ruleBasedResponse = getRuleBasedResponse(message);
            if (ruleBasedResponse != null) {
                return ruleBasedResponse;
            }
            
            // Fallback 3: Generic error message
            return "I'm currently experiencing technical difficulties. " +
                   "Please contact support@example.com for immediate assistance.";
        }
    }
    
    private String getRuleBasedResponse(String message) {
        String lowerMessage = message.toLowerCase();
        
        if (lowerMessage.contains("refund") || lowerMessage.contains("return")) {
            return "Our refund policy allows returns within 30 days. " +
                   "Visit example.com/refunds for details.";
        }
        
        if (lowerMessage.contains("shipping") || lowerMessage.contains("delivery")) {
            return "Standard shipping takes 5-7 business days. " +
                   "Track your order at example.com/track";
        }
        
        return null;
    }
}

✅ Good for: Always-available systems
❌ Not good for: When accuracy is critical (fail instead)

Example 4: Async Processing with Timeout

@Service
public class AsyncLLMService {
    private final ChatModel chatModel;
    private final ExecutorService executor = Executors.newFixedThreadPool(10);
    
    public CompletableFuture<String> chatAsync(String message) {
        return CompletableFuture.supplyAsync(() -> {
            try {
                return chatModel.call(new Prompt(message))
                    .getResult()
                    .getOutput()
                    .getContent();
            } catch (Exception e) {
                throw new CompletionException(e);
            }
        }, executor)
        .orTimeout(30, TimeUnit.SECONDS)
        .exceptionally(e -> {
            if (e instanceof TimeoutException) {
                log.error("LLM call timed out after 30 seconds");
                return "Request timed out. Please try a simpler query.";
            }
            log.error("LLM call failed", e);
            return "An error occurred. Please try again.";
        });
    }
}

✅ Good for: Non-blocking user interfaces
❌ Not good for: Synchronous workflows

Example 5: Health-Based Model Selection

@Service
public class HealthAwareModelRouter {
    private final List<ModelWithHealth> models;
    
    public String chat(String message) {
        // Try models in order of health
        List<ModelWithHealth> sorted = models.stream()
            .sorted(Comparator.comparing(ModelWithHealth::getHealth).reversed())
            .toList();
        
        for (ModelWithHealth model : sorted) {
            if (model.getHealth() < 0.3) {
                log.warn("Skipping unhealthy model: {}", model.getName());
                continue;
            }
            
            try {
                String response = model.getChatModel().call(new Prompt(message))
                    .getResult()
                    .getOutput()
                    .getContent();
                
                model.recordSuccess();
                return response;
            } catch (Exception e) {
                model.recordFailure();
                log.warn("Model {} failed, trying next", model.getName(), e);
            }
        }
        
        throw new AllModelsFailedException("All models unhealthy");
    }
    
    static class ModelWithHealth {
        private final String name;
        private final ChatModel chatModel;
        private final AtomicInteger successCount = new AtomicInteger(0);
        private final AtomicInteger failureCount = new AtomicInteger(0);
        
        public double getHealth() {
            int total = successCount.get() + failureCount.get();
            return total == 0 ? 1.0 : (double) successCount.get() / total;
        }
        
        public void recordSuccess() {
            successCount.incrementAndGet();
            // Decay old failures
            if (successCount.get() % 10 == 0) {
                failureCount.set(failureCount.get() / 2);
            }
        }
        
        public void recordFailure() {
            failureCount.incrementAndGet();
        }
    }
}

✅ Good for: Multi-provider setups
❌ Not good for: Single provider (no alternatives)

Anti-Patterns

❌ No Timeout

// DON'T: Can hang indefinitely
String response = chatModel.call(prompt);

Why: LLM APIs can hang; ties up threads.

✅ DO: Always set timeout

@TimeLimiter(name = "llm", timeout = 30s)
String response = chatModel.call(prompt);

❌ Infinite Retries

// DON'T: Retry forever
while (true) {
    try {
        return chatModel.call(prompt);
    } catch (Exception e) {
        // Retry
    }
}

Why: Wastes resources; may never recover.

✅ DO: Limit retries

@Retry(name = "llm", maxAttempts = 3)
return chatModel.call(prompt);

❌ Retrying Permanent Errors

// DON'T: Retry 401 Unauthorized
@Retry(name = "llm")
return chatModel.call(prompt); // Retries all errors!

Why: API key won’t magically fix itself.

✅ DO: Only retry transient errors

resilience4j.retry.instances.llm:
  retry-exceptions:
    - HttpServerErrorException
    - SocketTimeoutException
  ignore-exceptions:
    - HttpClientErrorException

Testing Strategies

Chaos Testing

@SpringBootTest
class ChaosTest {
    @Autowired
    private ResilientChatService chatService;
    
    @MockBean
    private ChatModel chatModel;
    
    @Test
    void shouldHandleTransientFailures() {
        // Fail first 2 calls, succeed on 3rd
        when(chatModel.call(any()))
            .thenThrow(new HttpServerErrorException(HttpStatus.INTERNAL_SERVER_ERROR))
            .thenThrow(new HttpServerErrorException(HttpStatus.INTERNAL_SERVER_ERROR))
            .thenReturn(new ChatResponse(...));
        
        String result = chatService.chat("test");
        
        assertNotNull(result);
        verify(chatModel, times(3)).call(any());
    }
    
    @Test
    void shouldUseFallbackAfterMaxRetries() {
        when(chatModel.call(any()))
            .thenThrow(new HttpServerErrorException(HttpStatus.INTERNAL_SERVER_ERROR));
        
        String result = chatService.chat("test");
        
        assertTrue(result.contains("technical difficulties"));
    }
}

References

observability.md — Error tracking
chat-models.md — LLM integration
resilience/circuit-breaker.md — Resilience patterns

Failure Handling

Failure Handling

Overview

Key Concepts

Common LLM Failure Modes

Resilience Pattern Stack

Best Practices

1. Configure Timeouts for All LLM Calls

2. Implement Circuit Breaker Pattern

3. Implement Smart Retry Logic

4. Implement Multi-Model Fallback

5. Cache Responses for Identical Queries

Code Examples

Example 1: Basic Timeout Handling

Example 2: Rate Limit Handling

Example 3: Graceful Degradation

Example 4: Async Processing with Timeout

Example 5: Health-Based Model Selection

Anti-Patterns

❌ No Timeout

❌ Infinite Retries

❌ Retrying Permanent Errors

Testing Strategies

Chaos Testing

References

Related Skills