Memory

Overview

Memory management in LLM applications involves storing and retrieving conversation history to maintain context across multiple turns. Spring AI supports multiple memory strategies including window-based (keep last N messages), summary-based (compress old messages), and custom implementations. Effective memory management prevents context window overflow while maintaining conversational coherence.

Key Concepts

Memory Types

┌─────────────────────────────────────────────────────────────┐
│                   Memory Strategy Comparison                 │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│ WINDOW MEMORY                                                │
│ ─────────────                                                │
│ Strategy: Keep last N messages                               │
│ Pros: Simple, fast, deterministic                            │
│ Cons: Loses old context, fixed window                        │
│ Use when: Short conversations (< 20 messages)                │
│                                                              │
│ SUMMARY MEMORY                                               │
│ ──────────────                                               │
│ Strategy: Summarize old messages, keep recent                │
│ Pros: Unbounded conversations, compressed context            │
│ Cons: Information loss, extra LLM calls                      │
│ Use when: Long conversations (> 50 messages)                 │
│                                                              │
│ ENTITY MEMORY                                                │
│ ─────────────                                                │
│ Strategy: Extract and store key entities (names, dates)      │
│ Pros: Preserves critical facts                               │
│ Cons: Complex extraction, may miss context                   │
│ Use when: Customer support, CRM integration                  │
│                                                              │
│ VECTOR MEMORY                                                │
│ ─────────────                                                │
│ Strategy: Embed messages, retrieve relevant history          │
│ Pros: Semantic retrieval, scales well                        │
│ Cons: Complex, slower, requires vector store                 │
│ Use when: Very long conversations, semantic recall needed    │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Context Window Budget

┌──────────────────────────────────────────────────┐
│           Context Window Allocation               │
├──────────────────────────────────────────────────┤
│                                                   │
│ Total: 8192 tokens (e.g., GPT-4)                  │
│                                                   │
│ ┌──────────────────────────────────────────────┐ │
│ │ System Prompt           500 tokens   (6%)    │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Memory (History)       3000 tokens  (37%)    │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Current Query           200 tokens   (2%)    │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Response Buffer        4000 tokens  (49%)    │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Safety Margin           492 tokens   (6%)    │ │
│ └──────────────────────────────────────────────┘ │
│                                                   │
└──────────────────────────────────────────────────┘

Best Practices

1. Set Explicit Memory Limits

Never allow unbounded conversation history.

@Service
public class ConversationService {
    private static final int MAX_MESSAGES = 20; // Last 20 messages
    private static final int MAX_TOKENS = 3000; // Max 3000 tokens
    
    public String chat(String userId, String message) {
        List<Message> history = memoryStore.get(userId);
        
        // Enforce message limit
        if (history.size() > MAX_MESSAGES) {
            history = history.subList(
                history.size() - MAX_MESSAGES,
                history.size()
            );
        }
        
        // Enforce token limit
        while (countTokens(history) > MAX_TOKENS && history.size() > 1) {
            history.remove(0); // Remove oldest message
        }
        
        // Add new message
        history.add(new UserMessage(message));
        
        // Generate response
        String response = chatModel.call(new Prompt(history))
            .getResult()
            .getOutput()
            .getContent();
        
        history.add(new AssistantMessage(response));
        memoryStore.save(userId, history);
        
        return response;
    }
}

2. Separate System Prompt from History

System prompt should not count against memory budget.

public List<Message> buildMessages(String systemPrompt, List<Message> history, String userMessage) {
    List<Message> messages = new ArrayList<>();
    
    // System prompt (not part of history)
    messages.add(new SystemMessage(systemPrompt));
    
    // Conversation history
    messages.addAll(history);
    
    // Current user message
    messages.add(new UserMessage(userMessage));
    
    return messages;
}

3. Implement Sliding Window with Summarization

Compress old messages when window is full.

@Service
public class SlidingWindowMemoryService {
    private final ChatModel chatModel;
    private static final int KEEP_RECENT = 10;
    private static final int SUMMARIZE_OLDER = 20;
    
    public List<Message> getMemory(String userId) {
        List<Message> allHistory = memoryStore.get(userId);
        
        if (allHistory.size() <= KEEP_RECENT) {
            return allHistory;
        }
        
        // Split into old (to summarize) and recent (keep as-is)
        List<Message> oldMessages = allHistory.subList(0, allHistory.size() - KEEP_RECENT);
        List<Message> recentMessages = allHistory.subList(allHistory.size() - KEEP_RECENT, allHistory.size());
        
        // Summarize old messages
        String summary = summarizeMessages(oldMessages);
        
        // Build result: [summary] + recent messages
        List<Message> result = new ArrayList<>();
        result.add(new SystemMessage("Previous conversation summary: " + summary));
        result.addAll(recentMessages);
        
        return result;
    }
    
    private String summarizeMessages(List<Message> messages) {
        String prompt = String.format("""
            Summarize the following conversation in 3-5 sentences, preserving key facts:
            
            %s
            """,
            messages.stream()
                .map(m -> m.getRole() + ": " + m.getContent())
                .collect(Collectors.joining("\n"))
        );
        
        return chatModel.call(new Prompt(prompt))
            .getResult()
            .getOutput()
            .getContent();
    }
}

4. Extract and Store Entities for Long-Term Memory

Preserve critical facts beyond conversation window.

@Service
public class EntityMemoryService {
    private final ChatModel chatModel;
    private final EntityStore entityStore;
    
    public void extractAndStoreEntities(String userId, String message, String response) {
        String prompt = String.format("""
            Extract key entities (names, dates, preferences, IDs) from this conversation:
            
            User: %s
            Assistant: %s
            
            Output JSON:
            {"entities": [{"type": "...", "value": "...", "context": "..."}]}
            """, message, response);
        
        String json = chatModel.call(new Prompt(prompt))
            .getResult()
            .getOutput()
            .getContent();
        
        List<Entity> entities = parseEntities(json);
        entityStore.save(userId, entities);
    }
    
    public String getEntityContext(String userId) {
        List<Entity> entities = entityStore.get(userId);
        return entities.stream()
            .map(e -> e.getType() + ": " + e.getValue())
            .collect(Collectors.joining(", "));
    }
}

5. Use Vector Memory for Semantic Retrieval

Retrieve relevant history based on semantic similarity, not recency.

@Service
public class VectorMemoryService {
    private final VectorStore vectorStore;
    private final EmbeddingModel embeddingModel;
    
    public void saveMessage(String userId, Message message) {
        Document doc = new Document(
            message.getContent(),
            Map.of(
                "userId", userId,
                "role", message.getRole(),
                "timestamp", System.currentTimeMillis()
            )
        );
        vectorStore.add(List.of(doc));
    }
    
    public List<Message> retrieveRelevantHistory(String userId, String currentQuery, int k) {
        List<Document> relevant = vectorStore.similaritySearch(
            SearchRequest.query(currentQuery)
                .withTopK(k)
                .withFilterExpression("userId == '" + userId + "'")
        );
        
        return relevant.stream()
            .map(doc -> new UserMessage(doc.getContent()))
            .collect(Collectors.toList());
    }
}

Code Examples

Example 1: Simple Window Memory

@Service
public class WindowMemoryService {
    private final Map<String, LinkedList<Message>> conversations = new ConcurrentHashMap<>();
    private static final int WINDOW_SIZE = 10;
    
    public void addMessage(String userId, Message message) {
        conversations.computeIfAbsent(userId, k -> new LinkedList<>()).add(message);
        
        // Enforce window size
        LinkedList<Message> history = conversations.get(userId);
        while (history.size() > WINDOW_SIZE) {
            history.removeFirst();
        }
    }
    
    public List<Message> getHistory(String userId) {
        return new ArrayList<>(
            conversations.getOrDefault(userId, new LinkedList<>())
        );
    }
}

✅ Good for: Short conversations, simple use cases
❌ Not good for: Long-running conversations (loses context)

Example 2: Token-Budgeted Memory

@Service
public class TokenBudgetMemoryService {
    private static final int MAX_TOKENS = 2000;
    
    public List<Message> getMemory(String userId) {
        List<Message> fullHistory = memoryStore.get(userId);
        List<Message> result = new ArrayList<>();
        int totalTokens = 0;
        
        // Add messages from most recent, up to token budget
        for (int i = fullHistory.size() - 1; i >= 0; i--) {
            Message msg = fullHistory.get(i);
            int tokens = estimateTokens(msg.getContent());
            
            if (totalTokens + tokens > MAX_TOKENS) {
                break;
            }
            
            result.add(0, msg); // Prepend to maintain order
            totalTokens += tokens;
        }
        
        return result;
    }
    
    private int estimateTokens(String text) {
        return (int) Math.ceil(text.length() / 4.0); // Rough estimate
    }
}

✅ Good for: Fixed context window models
❌ Not good for: When message order matters more than token count

Example 3: Periodic Summarization

@Service
public class PeriodicSummarizationService {
    private final ChatModel chatModel;
    
    @Scheduled(fixedDelay = 300000) // Every 5 minutes
    public void summarizeStaleConversations() {
        List<String> staleUserIds = findConversationsOlderThan(Duration.ofMinutes(10));
        
        for (String userId : staleUserIds) {
            List<Message> history = memoryStore.get(userId);
            
            if (history.size() > 20) {
                String summary = summarizeConversation(history);
                
                // Replace with summary + last 5 messages
                List<Message> compressed = new ArrayList<>();
                compressed.add(new SystemMessage("Summary: " + summary));
                compressed.addAll(history.subList(history.size() - 5, history.size()));
                
                memoryStore.save(userId, compressed);
            }
        }
    }
}

✅ Good for: Background compression, resource optimization
❌ Not good for: Real-time interactions (stale summary)

Example 4: Hybrid Memory (Window + Entities)

@Service
public class HybridMemoryService {
    private final WindowMemoryService windowMemory;
    private final EntityMemoryService entityMemory;
    
    public List<Message> buildContext(String userId, String currentQuery) {
        List<Message> messages = new ArrayList<>();
        
        // Add entity context as system message
        String entityContext = entityMemory.getEntityContext(userId);
        if (!entityContext.isEmpty()) {
            messages.add(new SystemMessage(
                "Known facts about user: " + entityContext
            ));
        }
        
        // Add recent conversation history
        messages.addAll(windowMemory.getHistory(userId));
        
        return messages;
    }
}

✅ Good for: Customer support, personalized chat
❌ Not good for: Anonymous users (no entity persistence)

Example 5: Memory Expiration and Cleanup

@Service
public class ExpiringMemoryService {
    private final Map<String, ConversationState> conversations = new ConcurrentHashMap<>();
    
    public void saveMessage(String userId, Message message) {
        ConversationState state = conversations.computeIfAbsent(
            userId,
            k -> new ConversationState()
        );
        
        state.addMessage(message);
        state.updateLastAccess();
    }
    
    @Scheduled(fixedDelay = 3600000) // Every hour
    public void cleanupStaleConversations() {
        Instant threshold = Instant.now().minus(Duration.ofHours(24));
        
        conversations.entrySet().removeIf(entry -> 
            entry.getValue().getLastAccess().isBefore(threshold)
        );
    }
    
    static class ConversationState {
        private final List<Message> messages = new ArrayList<>();
        private Instant lastAccess = Instant.now();
        
        public void addMessage(Message msg) {
            messages.add(msg);
            lastAccess = Instant.now();
        }
        
        public void updateLastAccess() {
            lastAccess = Instant.now();
        }
        
        public Instant getLastAccess() {
            return lastAccess;
        }
    }
}

✅ Good for: Memory management, privacy compliance
❌ Not good for: Long-term user profiles (use database)

Anti-Patterns

❌ Unbounded Memory

// DON'T: Conversation history grows indefinitely
conversations.get(userId).add(message); // No limit!

Why: Will exceed context window, increase latency, and raise costs.

✅ DO: Enforce limits

while (history.size() > MAX_MESSAGES) {
    history.remove(0);
}

❌ Including System Prompt in History

// DON'T: System prompt counted against memory budget
history.add(new SystemMessage("You are a helpful assistant."));
history.addAll(conversationMessages);

Why: Wastes context window on static content.

✅ DO: Separate system prompt

List<Message> messages = new ArrayList<>();
messages.add(systemMessage); // Not stored in history
messages.addAll(memoryStore.get(userId)); // Actual history

❌ No Privacy Controls

// DON'T: Store sensitive data indefinitely
memory.save(userId, new UserMessage(creditCardInfo));

Why: GDPR/CCPA violations, data breach risk.

✅ DO: Redact sensitive data and expire

String sanitized = piiRedactor.redact(message);
memory.saveWithTTL(userId, sanitized, Duration.ofHours(24));

Testing Strategies

Unit Testing Memory Limits

@Test
void shouldEnforceMessageLimit() {
    WindowMemoryService memory = new WindowMemoryService();
    
    for (int i = 0; i < 50; i++) {
        memory.addMessage("user1", new UserMessage("Message " + i));
    }
    
    assertEquals(10, memory.getHistory("user1").size());
    assertTrue(memory.getHistory("user1").get(0).getContent().contains("Message 40"));
}

Integration Testing Summarization

@SpringBootTest
class SummarizationIntegrationTest {
    @Autowired
    private SlidingWindowMemoryService memoryService;
    
    @Test
    void shouldSummarizeOldMessages() {
        String userId = "test-user";
        
        // Add 30 messages
        for (int i = 0; i < 30; i++) {
            memoryService.saveMessage(userId, new UserMessage("Message " + i));
        }
        
        List<Message> memory = memoryService.getMemory(userId);
        
        // Should have summary + 10 recent messages
        assertTrue(memory.size() <= 11);
        assertTrue(memory.get(0) instanceof SystemMessage);
        assertTrue(memory.get(0).getContent().contains("summary"));
    }
}

Performance Considerations

Concern	Strategy
Memory Usage	Expire old conversations; summarize instead of storing all
Latency	Cache memory in application layer; avoid DB roundtrips
Cost	Minimize summarization calls; batch when possible
Privacy	Auto-expire conversations; redact PII; encrypt at rest

References

chat-models.md — LLM integration
prompt-templates.md — System prompts
retrieval.md — Vector memory
observability.md — Tracking memory usage

Memory

Memory

Overview

Key Concepts

Memory Types

Context Window Budget

Best Practices

1. Set Explicit Memory Limits

2. Separate System Prompt from History

3. Implement Sliding Window with Summarization

4. Extract and Store Entities for Long-Term Memory

5. Use Vector Memory for Semantic Retrieval

Code Examples

Example 1: Simple Window Memory

Example 2: Token-Budgeted Memory

Example 3: Periodic Summarization

Example 4: Hybrid Memory (Window + Entities)

Example 5: Memory Expiration and Cleanup

Anti-Patterns

❌ Unbounded Memory

❌ Including System Prompt in History

❌ No Privacy Controls

Testing Strategies

Unit Testing Memory Limits

Integration Testing Summarization

Performance Considerations

References

Related Skills