Chat Models

Overview

Spring AI’s ChatModel abstraction provides a unified interface for interacting with Large Language Models (LLMs) from multiple providers (OpenAI, Anthropic, Azure OpenAI, Bedrock, Vertex AI, Ollama). This abstraction enables provider-agnostic code, easier testing, and the flexibility to switch or route between models based on cost, latency, or capability requirements.

Key Concepts

ChatModel Interface

public interface ChatModel extends Model<Prompt, ChatResponse> {
    ChatResponse call(Prompt prompt);
}

public interface StreamingChatModel extends ChatModel {
    Flux<ChatResponse> stream(Prompt prompt);
}

Model Selection Strategy

┌─────────────────────────────────────────────────────────────┐
│                  Model Selection Matrix                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Use Case                 Model          Rationale           │
│  ────────────             ─────          ─────────           │
│                                                              │
│  Simple classification    GPT-3.5 Turbo  Fast, cheap         │
│  Customer support FAQ     Claude Haiku   Low latency         │
│  Complex reasoning        GPT-4 Turbo    Accuracy            │
│  Code generation          Claude Sonnet  Code quality        │
│  Long documents           Claude Opus    Large context       │
│  Local/private data       Llama 3 (Ollama) No external API  │
│  High throughput          Cached GPT-4   Cost optimization   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Temperature and Determinism

Temperature = 0: Deterministic output (same input → same output)
- Use for: Production workflows, testing, data extraction
Temperature 0.3-0.5: Slightly creative but mostly consistent
- Use for: Customer support, FAQ answering
Temperature 0.7-0.9: Creative and varied
- Use for: Content generation, brainstorming
Temperature 1.0+: Maximum creativity/randomness
- Use for: Creative writing, exploration

Best Practices

1. Use Provider-Agnostic Code

Always code against ChatModel interface, not concrete implementations.

// ✅ GOOD: Provider-agnostic
@Service
public class SummaryService {
    private final ChatModel chatModel; // Interface, not OpenAiChatModel
    
    @Autowired
    public SummaryService(ChatModel chatModel) {
        this.chatModel = chatModel;
    }
    
    public String summarize(String text) {
        return chatModel.call(new Prompt("Summarize: " + text))
            .getResult()
            .getOutput()
            .getContent();
    }
}

2. Configure Temperature Per Use Case

Production workflows should default to temperature = 0 for consistency.

# application.yaml
spring:
  ai:
    openai:
      chat:
        options:
          model: gpt-4-turbo
          temperature: 0.0  # Deterministic by default

3. Use Streaming for User-Facing Interactions

Stream responses for perceived performance when latency > 2 seconds.

@Service
public class ChatService {
    private final StreamingChatModel chatModel;
    
    public Flux<String> chatStream(String message) {
        return chatModel.stream(new Prompt(message))
            .map(response -> response.getResult().getOutput().getContent());
    }
}

4. Implement Model Routing for Cost Optimization

Route simple queries to cheap models, complex queries to expensive models.

@Service
public class SmartChatRouter {
    private final ChatModel cheapModel;  // GPT-3.5
    private final ChatModel expensiveModel; // GPT-4
    private final ComplexityClassifier classifier;
    
    public String answer(String query) {
        ChatModel selected = classifier.isComplex(query) 
            ? expensiveModel 
            : cheapModel;
        
        return selected.call(new Prompt(query))
            .getResult()
            .getOutput()
            .getContent();
    }
}

5. Set Timeouts and Max Tokens

Prevent runaway costs and latency.

ChatResponse response = chatModel.call(
    new Prompt(
        "Explain quantum computing",
        ChatOptions.builder()
            .withModel("gpt-4-turbo")
            .withTemperature(0.0)
            .withMaxTokens(500)  // Limit output length
            .build()
    )
);

Code Examples

Example 1: Basic Synchronous Chat

@Service
public class QuestionAnswerService {
    private final ChatModel chatModel;
    
    public String answer(String question) {
        Prompt prompt = new Prompt(
            List.of(new UserMessage(question))
        );
        
        ChatResponse response = chatModel.call(prompt);
        return response.getResult().getOutput().getContent();
    }
}

✅ Good for: Simple Q&A, data extraction, classification
❌ Not good for: Long-running tasks without streaming

Example 2: Streaming Chat with Real-Time Updates

@RestController
public class StreamingChatController {
    private final StreamingChatModel chatModel;
    
    @GetMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<ServerSentEvent<String>> streamChat(@RequestParam String message) {
        return chatModel.stream(new Prompt(message))
            .map(response -> response.getResult().getOutput().getContent())
            .map(content -> ServerSentEvent.builder(content).build());
    }
}

✅ Good for: Interactive UIs, long-form content generation
❌ Not good for: Batch processing, background jobs

Example 3: Multi-Turn Conversation

@Service
public class ConversationService {
    private final ChatModel chatModel;
    
    public String chat(List<Message> conversationHistory, String newMessage) {
        // Add user's new message
        conversationHistory.add(new UserMessage(newMessage));
        
        // Create prompt with full history
        Prompt prompt = new Prompt(conversationHistory);
        
        // Get response
        ChatResponse response = chatModel.call(prompt);
        String assistantReply = response.getResult().getOutput().getContent();
        
        // Add assistant's response to history
        conversationHistory.add(new AssistantMessage(assistantReply));
        
        return assistantReply;
    }
}

✅ Good for: Chatbots, customer support
❌ Not good for: Unbounded conversations (will exceed context window)

Example 4: Model Fallback on Failure

@Service
public class ResilientChatService {
    private final ChatModel primaryModel;
    private final ChatModel fallbackModel;
    
    @CircuitBreaker(name = "primaryLLM", fallbackMethod = "fallbackChat")
    public String chat(String message) {
        return primaryModel.call(new Prompt(message))
            .getResult()
            .getOutput()
            .getContent();
    }
    
    private String fallbackChat(String message, Exception e) {
        log.warn("Primary model failed, using fallback", e);
        return fallbackModel.call(new Prompt(message))
            .getResult()
            .getOutput()
            .getContent();
    }
}

✅ Good for: Production reliability
❌ Not good for: Cost-insensitive applications (double cost on retries)

Example 5: Multi-Provider Configuration

@Configuration
public class MultiModelConfig {
    
    @Bean
    @Primary
    public ChatModel defaultChatModel(
        @Value("${spring.ai.openai.api-key}") String openAiKey
    ) {
        return new OpenAiChatModel(
            OpenAiApi.builder().apiKey(openAiKey).build(),
            ChatOptions.builder()
                .withModel("gpt-4-turbo")
                .withTemperature(0.0)
                .build()
        );
    }
    
    @Bean("cheapModel")
    public ChatModel cheapChatModel(
        @Value("${spring.ai.openai.api-key}") String openAiKey
    ) {
        return new OpenAiChatModel(
            OpenAiApi.builder().apiKey(openAiKey).build(),
            ChatOptions.builder()
                .withModel("gpt-3.5-turbo")
                .withTemperature(0.0)
                .build()
        );
    }
    
    @Bean("localModel")
    public ChatModel localChatModel() {
        return new OllamaChatModel(
            OllamaApi.builder().baseUrl("http://localhost:11434").build(),
            ChatOptions.builder()
                .withModel("llama3")
                .build()
        );
    }
}

Anti-Patterns

❌ Hardcoding Provider-Specific Logic

// DON'T: Tightly coupled to OpenAI
OpenAiChatModel openAi = new OpenAiChatModel(...);
String response = openAi.call(prompt);

Why: Cannot switch providers without code changes.

✅ DO: Use ChatModel interface

@Autowired
private ChatModel chatModel; // Can be OpenAI, Anthropic, etc.

❌ Using High Temperature in Production

// DON'T: Non-deterministic in production
ChatOptions.builder()
    .withTemperature(0.9)  // Different output each time
    .build();

Why: Inconsistent behavior, harder to test, unpredictable UX.

✅ DO: Use temperature = 0 for production workflows

ChatOptions.builder()
    .withTemperature(0.0)  // Same input → same output
    .build();

❌ Ignoring Token Limits

// DON'T: No max tokens limit
chatModel.call(new Prompt("Write a novel about..."));

Why: Runaway costs, unbounded latency, possible timeout.

✅ DO: Always set max tokens

chatModel.call(
    new Prompt(
        "Write a novel about...",
        ChatOptions.builder().withMaxTokens(1000).build()
    )
);

❌ Synchronous Calls in User-Facing APIs

// DON'T: Blocking call in controller
@GetMapping("/chat")
public String chat(@RequestParam String message) {
    return chatModel.call(new Prompt(message))  // Blocks for 2-10 seconds
        .getResult().getOutput().getContent();
}

Why: Poor UX, thread pool exhaustion, perceived latency.

✅ DO: Use streaming or async

@GetMapping(value = "/chat", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> chat(@RequestParam String message) {
    return streamingChatModel.stream(new Prompt(message))
        .map(r -> r.getResult().getOutput().getContent());
}

Testing Strategies

Unit Testing with Mocks

@ExtendWith(MockitoExtension.class)
class SummaryServiceTest {
    @Mock
    private ChatModel chatModel;
    
    @InjectMocks
    private SummaryService summaryService;
    
    @Test
    void shouldSummarizeText() {
        // Given
        when(chatModel.call(any(Prompt.class)))
            .thenReturn(new ChatResponse(
                List.of(new Generation(
                    new AssistantMessage("Summary of text")
                ))
            ));
        
        // When
        String summary = summaryService.summarize("Long text...");
        
        // Then
        assertEquals("Summary of text", summary);
    }
}

Integration Testing with Local Models

@SpringBootTest
@TestPropertySource(properties = {
    "spring.ai.ollama.base-url=http://localhost:11434",
    "spring.ai.ollama.chat.model=llama3"
})
class ChatServiceIntegrationTest {
    @Autowired
    private ChatModel chatModel;
    
    @Test
    void shouldAnswerQuestion() {
        String answer = chatModel.call(new Prompt("What is 2+2?"))
            .getResult()
            .getOutput()
            .getContent();
        
        assertTrue(answer.contains("4"));
    }
}

Golden Dataset Testing

@Test
void shouldProduceDeterministicOutputs() {
    List<String> questions = goldenDataset.getQuestions();
    List<String> expectedAnswers = goldenDataset.getExpectedAnswers();
    
    for (int i = 0; i < questions.size(); i++) {
        String actual = chatModel.call(
            new Prompt(
                questions.get(i),
                ChatOptions.builder().withTemperature(0.0).build()
            )
        ).getResult().getOutput().getContent();
        
        assertSimilar(expectedAnswers.get(i), actual, 0.9); // 90% similarity
    }
}

Performance Considerations

Concern	Strategy
Latency	Use streaming for > 2s responses; route simple queries to fast models
Cost	Route by complexity; cache responses; set max tokens
Throughput	Use async clients; batch requests; connection pooling
Context Window	Summarize old conversations; enforce message limits
Token Usage	Monitor input/output tokens; alert on anomalies

Observability

Metrics to Track

@Component
@Aspect
public class ChatModelMetrics {
    private final MeterRegistry registry;
    
    @Around("execution(* org.springframework.ai.chat.ChatModel.call(..))")
    public Object trackChatCall(ProceedingJoinPoint joinPoint) throws Throwable {
        Timer.Sample sample = Timer.start(registry);
        
        try {
            ChatResponse response = (ChatResponse) joinPoint.proceed();
            
            // Track token usage
            registry.counter("llm.tokens.input", 
                "model", extractModel(response))
                .increment(response.getMetadata().getUsage().getPromptTokens());
            
            registry.counter("llm.tokens.output",
                "model", extractModel(response))
                .increment(response.getMetadata().getUsage().getGenerationTokens());
            
            return response;
        } finally {
            sample.stop(registry.timer("llm.call.duration"));
        }
    }
}

References

embedding-models.md — Vector representations for semantic search
prompt-templates.md — Structured, versioned prompts
tool-calling.md — LLM-invoked functions
memory.md — Conversation context management
observability.md — Metrics and tracing
failure-handling.md — Resilience patterns

Chat Models

Chat Models

Overview

Key Concepts

ChatModel Interface

Model Selection Strategy

Temperature and Determinism

Best Practices

1. Use Provider-Agnostic Code

2. Configure Temperature Per Use Case

3. Use Streaming for User-Facing Interactions

4. Implement Model Routing for Cost Optimization

5. Set Timeouts and Max Tokens

Code Examples

Example 1: Basic Synchronous Chat

Example 2: Streaming Chat with Real-Time Updates

Example 3: Multi-Turn Conversation

Example 4: Model Fallback on Failure

Example 5: Multi-Provider Configuration

Anti-Patterns

❌ Hardcoding Provider-Specific Logic

❌ Using High Temperature in Production

❌ Ignoring Token Limits

❌ Synchronous Calls in User-Facing APIs

Testing Strategies

Unit Testing with Mocks

Integration Testing with Local Models

Golden Dataset Testing

Performance Considerations

Observability

Metrics to Track

References

Related Skills