Memory
Memory
Overview
Memory management in LLM applications involves storing and retrieving conversation history to maintain context across multiple turns. Spring AI supports multiple memory strategies including window-based (keep last N messages), summary-based (compress old messages), and custom implementations. Effective memory management prevents context window overflow while maintaining conversational coherence.
Key Concepts
Memory Types
┌─────────────────────────────────────────────────────────────┐
│ Memory Strategy Comparison │
├─────────────────────────────────────────────────────────────┤
│ │
│ WINDOW MEMORY │
│ ───────────── │
│ Strategy: Keep last N messages │
│ Pros: Simple, fast, deterministic │
│ Cons: Loses old context, fixed window │
│ Use when: Short conversations (< 20 messages) │
│ │
│ SUMMARY MEMORY │
│ ────────────── │
│ Strategy: Summarize old messages, keep recent │
│ Pros: Unbounded conversations, compressed context │
│ Cons: Information loss, extra LLM calls │
│ Use when: Long conversations (> 50 messages) │
│ │
│ ENTITY MEMORY │
│ ───────────── │
│ Strategy: Extract and store key entities (names, dates) │
│ Pros: Preserves critical facts │
│ Cons: Complex extraction, may miss context │
│ Use when: Customer support, CRM integration │
│ │
│ VECTOR MEMORY │
│ ───────────── │
│ Strategy: Embed messages, retrieve relevant history │
│ Pros: Semantic retrieval, scales well │
│ Cons: Complex, slower, requires vector store │
│ Use when: Very long conversations, semantic recall needed │
│ │
└─────────────────────────────────────────────────────────────┘
Context Window Budget
┌──────────────────────────────────────────────────┐
│ Context Window Allocation │
├──────────────────────────────────────────────────┤
│ │
│ Total: 8192 tokens (e.g., GPT-4) │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ System Prompt 500 tokens (6%) │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Memory (History) 3000 tokens (37%) │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Current Query 200 tokens (2%) │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Response Buffer 4000 tokens (49%) │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Safety Margin 492 tokens (6%) │ │
│ └──────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────┘
Best Practices
1. Set Explicit Memory Limits
Never allow unbounded conversation history.
@Service
public class ConversationService {
private static final int MAX_MESSAGES = 20; // Last 20 messages
private static final int MAX_TOKENS = 3000; // Max 3000 tokens
public String chat(String userId, String message) {
List<Message> history = memoryStore.get(userId);
// Enforce message limit
if (history.size() > MAX_MESSAGES) {
history = history.subList(
history.size() - MAX_MESSAGES,
history.size()
);
}
// Enforce token limit
while (countTokens(history) > MAX_TOKENS && history.size() > 1) {
history.remove(0); // Remove oldest message
}
// Add new message
history.add(new UserMessage(message));
// Generate response
String response = chatModel.call(new Prompt(history))
.getResult()
.getOutput()
.getContent();
history.add(new AssistantMessage(response));
memoryStore.save(userId, history);
return response;
}
}
2. Separate System Prompt from History
System prompt should not count against memory budget.
public List<Message> buildMessages(String systemPrompt, List<Message> history, String userMessage) {
List<Message> messages = new ArrayList<>();
// System prompt (not part of history)
messages.add(new SystemMessage(systemPrompt));
// Conversation history
messages.addAll(history);
// Current user message
messages.add(new UserMessage(userMessage));
return messages;
}
3. Implement Sliding Window with Summarization
Compress old messages when window is full.
@Service
public class SlidingWindowMemoryService {
private final ChatModel chatModel;
private static final int KEEP_RECENT = 10;
private static final int SUMMARIZE_OLDER = 20;
public List<Message> getMemory(String userId) {
List<Message> allHistory = memoryStore.get(userId);
if (allHistory.size() <= KEEP_RECENT) {
return allHistory;
}
// Split into old (to summarize) and recent (keep as-is)
List<Message> oldMessages = allHistory.subList(0, allHistory.size() - KEEP_RECENT);
List<Message> recentMessages = allHistory.subList(allHistory.size() - KEEP_RECENT, allHistory.size());
// Summarize old messages
String summary = summarizeMessages(oldMessages);
// Build result: [summary] + recent messages
List<Message> result = new ArrayList<>();
result.add(new SystemMessage("Previous conversation summary: " + summary));
result.addAll(recentMessages);
return result;
}
private String summarizeMessages(List<Message> messages) {
String prompt = String.format("""
Summarize the following conversation in 3-5 sentences, preserving key facts:
%s
""",
messages.stream()
.map(m -> m.getRole() + ": " + m.getContent())
.collect(Collectors.joining("\n"))
);
return chatModel.call(new Prompt(prompt))
.getResult()
.getOutput()
.getContent();
}
}
4. Extract and Store Entities for Long-Term Memory
Preserve critical facts beyond conversation window.
@Service
public class EntityMemoryService {
private final ChatModel chatModel;
private final EntityStore entityStore;
public void extractAndStoreEntities(String userId, String message, String response) {
String prompt = String.format("""
Extract key entities (names, dates, preferences, IDs) from this conversation:
User: %s
Assistant: %s
Output JSON:
{"entities": [{"type": "...", "value": "...", "context": "..."}]}
""", message, response);
String json = chatModel.call(new Prompt(prompt))
.getResult()
.getOutput()
.getContent();
List<Entity> entities = parseEntities(json);
entityStore.save(userId, entities);
}
public String getEntityContext(String userId) {
List<Entity> entities = entityStore.get(userId);
return entities.stream()
.map(e -> e.getType() + ": " + e.getValue())
.collect(Collectors.joining(", "));
}
}
5. Use Vector Memory for Semantic Retrieval
Retrieve relevant history based on semantic similarity, not recency.
@Service
public class VectorMemoryService {
private final VectorStore vectorStore;
private final EmbeddingModel embeddingModel;
public void saveMessage(String userId, Message message) {
Document doc = new Document(
message.getContent(),
Map.of(
"userId", userId,
"role", message.getRole(),
"timestamp", System.currentTimeMillis()
)
);
vectorStore.add(List.of(doc));
}
public List<Message> retrieveRelevantHistory(String userId, String currentQuery, int k) {
List<Document> relevant = vectorStore.similaritySearch(
SearchRequest.query(currentQuery)
.withTopK(k)
.withFilterExpression("userId == '" + userId + "'")
);
return relevant.stream()
.map(doc -> new UserMessage(doc.getContent()))
.collect(Collectors.toList());
}
}
Code Examples
Example 1: Simple Window Memory
@Service
public class WindowMemoryService {
private final Map<String, LinkedList<Message>> conversations = new ConcurrentHashMap<>();
private static final int WINDOW_SIZE = 10;
public void addMessage(String userId, Message message) {
conversations.computeIfAbsent(userId, k -> new LinkedList<>()).add(message);
// Enforce window size
LinkedList<Message> history = conversations.get(userId);
while (history.size() > WINDOW_SIZE) {
history.removeFirst();
}
}
public List<Message> getHistory(String userId) {
return new ArrayList<>(
conversations.getOrDefault(userId, new LinkedList<>())
);
}
}
✅ Good for: Short conversations, simple use cases
❌ Not good for: Long-running conversations (loses context)
Example 2: Token-Budgeted Memory
@Service
public class TokenBudgetMemoryService {
private static final int MAX_TOKENS = 2000;
public List<Message> getMemory(String userId) {
List<Message> fullHistory = memoryStore.get(userId);
List<Message> result = new ArrayList<>();
int totalTokens = 0;
// Add messages from most recent, up to token budget
for (int i = fullHistory.size() - 1; i >= 0; i--) {
Message msg = fullHistory.get(i);
int tokens = estimateTokens(msg.getContent());
if (totalTokens + tokens > MAX_TOKENS) {
break;
}
result.add(0, msg); // Prepend to maintain order
totalTokens += tokens;
}
return result;
}
private int estimateTokens(String text) {
return (int) Math.ceil(text.length() / 4.0); // Rough estimate
}
}
✅ Good for: Fixed context window models
❌ Not good for: When message order matters more than token count
Example 3: Periodic Summarization
@Service
public class PeriodicSummarizationService {
private final ChatModel chatModel;
@Scheduled(fixedDelay = 300000) // Every 5 minutes
public void summarizeStaleConversations() {
List<String> staleUserIds = findConversationsOlderThan(Duration.ofMinutes(10));
for (String userId : staleUserIds) {
List<Message> history = memoryStore.get(userId);
if (history.size() > 20) {
String summary = summarizeConversation(history);
// Replace with summary + last 5 messages
List<Message> compressed = new ArrayList<>();
compressed.add(new SystemMessage("Summary: " + summary));
compressed.addAll(history.subList(history.size() - 5, history.size()));
memoryStore.save(userId, compressed);
}
}
}
}
✅ Good for: Background compression, resource optimization
❌ Not good for: Real-time interactions (stale summary)
Example 4: Hybrid Memory (Window + Entities)
@Service
public class HybridMemoryService {
private final WindowMemoryService windowMemory;
private final EntityMemoryService entityMemory;
public List<Message> buildContext(String userId, String currentQuery) {
List<Message> messages = new ArrayList<>();
// Add entity context as system message
String entityContext = entityMemory.getEntityContext(userId);
if (!entityContext.isEmpty()) {
messages.add(new SystemMessage(
"Known facts about user: " + entityContext
));
}
// Add recent conversation history
messages.addAll(windowMemory.getHistory(userId));
return messages;
}
}
✅ Good for: Customer support, personalized chat
❌ Not good for: Anonymous users (no entity persistence)
Example 5: Memory Expiration and Cleanup
@Service
public class ExpiringMemoryService {
private final Map<String, ConversationState> conversations = new ConcurrentHashMap<>();
public void saveMessage(String userId, Message message) {
ConversationState state = conversations.computeIfAbsent(
userId,
k -> new ConversationState()
);
state.addMessage(message);
state.updateLastAccess();
}
@Scheduled(fixedDelay = 3600000) // Every hour
public void cleanupStaleConversations() {
Instant threshold = Instant.now().minus(Duration.ofHours(24));
conversations.entrySet().removeIf(entry ->
entry.getValue().getLastAccess().isBefore(threshold)
);
}
static class ConversationState {
private final List<Message> messages = new ArrayList<>();
private Instant lastAccess = Instant.now();
public void addMessage(Message msg) {
messages.add(msg);
lastAccess = Instant.now();
}
public void updateLastAccess() {
lastAccess = Instant.now();
}
public Instant getLastAccess() {
return lastAccess;
}
}
}
✅ Good for: Memory management, privacy compliance
❌ Not good for: Long-term user profiles (use database)
Anti-Patterns
❌ Unbounded Memory
// DON'T: Conversation history grows indefinitely
conversations.get(userId).add(message); // No limit!
Why: Will exceed context window, increase latency, and raise costs.
✅ DO: Enforce limits
while (history.size() > MAX_MESSAGES) {
history.remove(0);
}
❌ Including System Prompt in History
// DON'T: System prompt counted against memory budget
history.add(new SystemMessage("You are a helpful assistant."));
history.addAll(conversationMessages);
Why: Wastes context window on static content.
✅ DO: Separate system prompt
List<Message> messages = new ArrayList<>();
messages.add(systemMessage); // Not stored in history
messages.addAll(memoryStore.get(userId)); // Actual history
❌ No Privacy Controls
// DON'T: Store sensitive data indefinitely
memory.save(userId, new UserMessage(creditCardInfo));
Why: GDPR/CCPA violations, data breach risk.
✅ DO: Redact sensitive data and expire
String sanitized = piiRedactor.redact(message);
memory.saveWithTTL(userId, sanitized, Duration.ofHours(24));
Testing Strategies
Unit Testing Memory Limits
@Test
void shouldEnforceMessageLimit() {
WindowMemoryService memory = new WindowMemoryService();
for (int i = 0; i < 50; i++) {
memory.addMessage("user1", new UserMessage("Message " + i));
}
assertEquals(10, memory.getHistory("user1").size());
assertTrue(memory.getHistory("user1").get(0).getContent().contains("Message 40"));
}
Integration Testing Summarization
@SpringBootTest
class SummarizationIntegrationTest {
@Autowired
private SlidingWindowMemoryService memoryService;
@Test
void shouldSummarizeOldMessages() {
String userId = "test-user";
// Add 30 messages
for (int i = 0; i < 30; i++) {
memoryService.saveMessage(userId, new UserMessage("Message " + i));
}
List<Message> memory = memoryService.getMemory(userId);
// Should have summary + 10 recent messages
assertTrue(memory.size() <= 11);
assertTrue(memory.get(0) instanceof SystemMessage);
assertTrue(memory.get(0).getContent().contains("summary"));
}
}
Performance Considerations
| Concern | Strategy |
|---|---|
| Memory Usage | Expire old conversations; summarize instead of storing all |
| Latency | Cache memory in application layer; avoid DB roundtrips |
| Cost | Minimize summarization calls; batch when possible |
| Privacy | Auto-expire conversations; redact PII; encrypt at rest |
References
Related Skills
chat-models.md— LLM integrationprompt-templates.md— System promptsretrieval.md— Vector memoryobservability.md— Tracking memory usage