Chat Models
Chat Models
Overview
Spring AI’s ChatModel abstraction provides a unified interface for interacting with Large Language Models (LLMs) from multiple providers (OpenAI, Anthropic, Azure OpenAI, Bedrock, Vertex AI, Ollama). This abstraction enables provider-agnostic code, easier testing, and the flexibility to switch or route between models based on cost, latency, or capability requirements.
Key Concepts
ChatModel Interface
public interface ChatModel extends Model<Prompt, ChatResponse> {
ChatResponse call(Prompt prompt);
}
public interface StreamingChatModel extends ChatModel {
Flux<ChatResponse> stream(Prompt prompt);
}
Model Selection Strategy
┌─────────────────────────────────────────────────────────────┐
│ Model Selection Matrix │
├─────────────────────────────────────────────────────────────┤
│ │
│ Use Case Model Rationale │
│ ──────────── ───── ───────── │
│ │
│ Simple classification GPT-3.5 Turbo Fast, cheap │
│ Customer support FAQ Claude Haiku Low latency │
│ Complex reasoning GPT-4 Turbo Accuracy │
│ Code generation Claude Sonnet Code quality │
│ Long documents Claude Opus Large context │
│ Local/private data Llama 3 (Ollama) No external API │
│ High throughput Cached GPT-4 Cost optimization │
│ │
└─────────────────────────────────────────────────────────────┘
Temperature and Determinism
- Temperature = 0: Deterministic output (same input → same output)
- Use for: Production workflows, testing, data extraction
- Temperature 0.3-0.5: Slightly creative but mostly consistent
- Use for: Customer support, FAQ answering
- Temperature 0.7-0.9: Creative and varied
- Use for: Content generation, brainstorming
- Temperature 1.0+: Maximum creativity/randomness
- Use for: Creative writing, exploration
Best Practices
1. Use Provider-Agnostic Code
Always code against ChatModel interface, not concrete implementations.
// ✅ GOOD: Provider-agnostic
@Service
public class SummaryService {
private final ChatModel chatModel; // Interface, not OpenAiChatModel
@Autowired
public SummaryService(ChatModel chatModel) {
this.chatModel = chatModel;
}
public String summarize(String text) {
return chatModel.call(new Prompt("Summarize: " + text))
.getResult()
.getOutput()
.getContent();
}
}
2. Configure Temperature Per Use Case
Production workflows should default to temperature = 0 for consistency.
# application.yaml
spring:
ai:
openai:
chat:
options:
model: gpt-4-turbo
temperature: 0.0 # Deterministic by default
3. Use Streaming for User-Facing Interactions
Stream responses for perceived performance when latency > 2 seconds.
@Service
public class ChatService {
private final StreamingChatModel chatModel;
public Flux<String> chatStream(String message) {
return chatModel.stream(new Prompt(message))
.map(response -> response.getResult().getOutput().getContent());
}
}
4. Implement Model Routing for Cost Optimization
Route simple queries to cheap models, complex queries to expensive models.
@Service
public class SmartChatRouter {
private final ChatModel cheapModel; // GPT-3.5
private final ChatModel expensiveModel; // GPT-4
private final ComplexityClassifier classifier;
public String answer(String query) {
ChatModel selected = classifier.isComplex(query)
? expensiveModel
: cheapModel;
return selected.call(new Prompt(query))
.getResult()
.getOutput()
.getContent();
}
}
5. Set Timeouts and Max Tokens
Prevent runaway costs and latency.
ChatResponse response = chatModel.call(
new Prompt(
"Explain quantum computing",
ChatOptions.builder()
.withModel("gpt-4-turbo")
.withTemperature(0.0)
.withMaxTokens(500) // Limit output length
.build()
)
);
Code Examples
Example 1: Basic Synchronous Chat
@Service
public class QuestionAnswerService {
private final ChatModel chatModel;
public String answer(String question) {
Prompt prompt = new Prompt(
List.of(new UserMessage(question))
);
ChatResponse response = chatModel.call(prompt);
return response.getResult().getOutput().getContent();
}
}
✅ Good for: Simple Q&A, data extraction, classification
❌ Not good for: Long-running tasks without streaming
Example 2: Streaming Chat with Real-Time Updates
@RestController
public class StreamingChatController {
private final StreamingChatModel chatModel;
@GetMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<ServerSentEvent<String>> streamChat(@RequestParam String message) {
return chatModel.stream(new Prompt(message))
.map(response -> response.getResult().getOutput().getContent())
.map(content -> ServerSentEvent.builder(content).build());
}
}
✅ Good for: Interactive UIs, long-form content generation
❌ Not good for: Batch processing, background jobs
Example 3: Multi-Turn Conversation
@Service
public class ConversationService {
private final ChatModel chatModel;
public String chat(List<Message> conversationHistory, String newMessage) {
// Add user's new message
conversationHistory.add(new UserMessage(newMessage));
// Create prompt with full history
Prompt prompt = new Prompt(conversationHistory);
// Get response
ChatResponse response = chatModel.call(prompt);
String assistantReply = response.getResult().getOutput().getContent();
// Add assistant's response to history
conversationHistory.add(new AssistantMessage(assistantReply));
return assistantReply;
}
}
✅ Good for: Chatbots, customer support
❌ Not good for: Unbounded conversations (will exceed context window)
Example 4: Model Fallback on Failure
@Service
public class ResilientChatService {
private final ChatModel primaryModel;
private final ChatModel fallbackModel;
@CircuitBreaker(name = "primaryLLM", fallbackMethod = "fallbackChat")
public String chat(String message) {
return primaryModel.call(new Prompt(message))
.getResult()
.getOutput()
.getContent();
}
private String fallbackChat(String message, Exception e) {
log.warn("Primary model failed, using fallback", e);
return fallbackModel.call(new Prompt(message))
.getResult()
.getOutput()
.getContent();
}
}
✅ Good for: Production reliability
❌ Not good for: Cost-insensitive applications (double cost on retries)
Example 5: Multi-Provider Configuration
@Configuration
public class MultiModelConfig {
@Bean
@Primary
public ChatModel defaultChatModel(
@Value("${spring.ai.openai.api-key}") String openAiKey
) {
return new OpenAiChatModel(
OpenAiApi.builder().apiKey(openAiKey).build(),
ChatOptions.builder()
.withModel("gpt-4-turbo")
.withTemperature(0.0)
.build()
);
}
@Bean("cheapModel")
public ChatModel cheapChatModel(
@Value("${spring.ai.openai.api-key}") String openAiKey
) {
return new OpenAiChatModel(
OpenAiApi.builder().apiKey(openAiKey).build(),
ChatOptions.builder()
.withModel("gpt-3.5-turbo")
.withTemperature(0.0)
.build()
);
}
@Bean("localModel")
public ChatModel localChatModel() {
return new OllamaChatModel(
OllamaApi.builder().baseUrl("http://localhost:11434").build(),
ChatOptions.builder()
.withModel("llama3")
.build()
);
}
}
Anti-Patterns
❌ Hardcoding Provider-Specific Logic
// DON'T: Tightly coupled to OpenAI
OpenAiChatModel openAi = new OpenAiChatModel(...);
String response = openAi.call(prompt);
Why: Cannot switch providers without code changes.
✅ DO: Use ChatModel interface
@Autowired
private ChatModel chatModel; // Can be OpenAI, Anthropic, etc.
❌ Using High Temperature in Production
// DON'T: Non-deterministic in production
ChatOptions.builder()
.withTemperature(0.9) // Different output each time
.build();
Why: Inconsistent behavior, harder to test, unpredictable UX.
✅ DO: Use temperature = 0 for production workflows
ChatOptions.builder()
.withTemperature(0.0) // Same input → same output
.build();
❌ Ignoring Token Limits
// DON'T: No max tokens limit
chatModel.call(new Prompt("Write a novel about..."));
Why: Runaway costs, unbounded latency, possible timeout.
✅ DO: Always set max tokens
chatModel.call(
new Prompt(
"Write a novel about...",
ChatOptions.builder().withMaxTokens(1000).build()
)
);
❌ Synchronous Calls in User-Facing APIs
// DON'T: Blocking call in controller
@GetMapping("/chat")
public String chat(@RequestParam String message) {
return chatModel.call(new Prompt(message)) // Blocks for 2-10 seconds
.getResult().getOutput().getContent();
}
Why: Poor UX, thread pool exhaustion, perceived latency.
✅ DO: Use streaming or async
@GetMapping(value = "/chat", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> chat(@RequestParam String message) {
return streamingChatModel.stream(new Prompt(message))
.map(r -> r.getResult().getOutput().getContent());
}
Testing Strategies
Unit Testing with Mocks
@ExtendWith(MockitoExtension.class)
class SummaryServiceTest {
@Mock
private ChatModel chatModel;
@InjectMocks
private SummaryService summaryService;
@Test
void shouldSummarizeText() {
// Given
when(chatModel.call(any(Prompt.class)))
.thenReturn(new ChatResponse(
List.of(new Generation(
new AssistantMessage("Summary of text")
))
));
// When
String summary = summaryService.summarize("Long text...");
// Then
assertEquals("Summary of text", summary);
}
}
Integration Testing with Local Models
@SpringBootTest
@TestPropertySource(properties = {
"spring.ai.ollama.base-url=http://localhost:11434",
"spring.ai.ollama.chat.model=llama3"
})
class ChatServiceIntegrationTest {
@Autowired
private ChatModel chatModel;
@Test
void shouldAnswerQuestion() {
String answer = chatModel.call(new Prompt("What is 2+2?"))
.getResult()
.getOutput()
.getContent();
assertTrue(answer.contains("4"));
}
}
Golden Dataset Testing
@Test
void shouldProduceDeterministicOutputs() {
List<String> questions = goldenDataset.getQuestions();
List<String> expectedAnswers = goldenDataset.getExpectedAnswers();
for (int i = 0; i < questions.size(); i++) {
String actual = chatModel.call(
new Prompt(
questions.get(i),
ChatOptions.builder().withTemperature(0.0).build()
)
).getResult().getOutput().getContent();
assertSimilar(expectedAnswers.get(i), actual, 0.9); // 90% similarity
}
}
Performance Considerations
| Concern | Strategy |
|---|---|
| Latency | Use streaming for > 2s responses; route simple queries to fast models |
| Cost | Route by complexity; cache responses; set max tokens |
| Throughput | Use async clients; batch requests; connection pooling |
| Context Window | Summarize old conversations; enforce message limits |
| Token Usage | Monitor input/output tokens; alert on anomalies |
Observability
Metrics to Track
@Component
@Aspect
public class ChatModelMetrics {
private final MeterRegistry registry;
@Around("execution(* org.springframework.ai.chat.ChatModel.call(..))")
public Object trackChatCall(ProceedingJoinPoint joinPoint) throws Throwable {
Timer.Sample sample = Timer.start(registry);
try {
ChatResponse response = (ChatResponse) joinPoint.proceed();
// Track token usage
registry.counter("llm.tokens.input",
"model", extractModel(response))
.increment(response.getMetadata().getUsage().getPromptTokens());
registry.counter("llm.tokens.output",
"model", extractModel(response))
.increment(response.getMetadata().getUsage().getGenerationTokens());
return response;
} finally {
sample.stop(registry.timer("llm.call.duration"));
}
}
}
References
Related Skills
embedding-models.md— Vector representations for semantic searchprompt-templates.md— Structured, versioned promptstool-calling.md— LLM-invoked functionsmemory.md— Conversation context managementobservability.md— Metrics and tracingfailure-handling.md— Resilience patterns