Retrieval (RAG)

Overview

Retrieval-Augmented Generation (RAG) grounds LLM responses in domain-specific knowledge by retrieving relevant context before generation. Spring AI’s VectorStore abstraction enables semantic search across embeddings, supporting in-memory, PostgreSQL (pgvector), Pinecone, Weaviate, and other vector databases. Effective retrieval is the most critical component of RAG quality.

Key Concepts

RAG Architecture

┌─────────────────────────────────────────────────────────────┐
│                  RAG Pipeline Architecture                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  INGESTION (Offline)                                         │
│  ────────────────────                                        │
│  1. Load documents (PDF, HTML, Markdown, etc.)               │
│  2. Chunk into 512-token segments with overlap               │
│  3. Generate embeddings via EmbeddingModel                   │
│  4. Store in VectorStore with metadata                       │
│                                                              │
│  RETRIEVAL (Online)                                          │
│  ───────────────────                                         │
│  1. User query → embed query                                 │
│  2. VectorStore.similaritySearch(query, k=5)                 │
│  3. (Optional) Rerank results by relevance                   │
│  4. Assemble context from top-k documents                    │
│  5. Build prompt: system + context + query                   │
│  6. ChatModel.call(prompt) → Answer                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

VectorStore Interface

public interface VectorStore {
    void add(List<Document> documents);
    Optional<Boolean> delete(List<String> idList);
    List<Document> similaritySearch(SearchRequest request);
}

public class SearchRequest {
    public static SearchRequest query(String query);
    public SearchRequest withTopK(int k);
    public SearchRequest withSimilarityThreshold(double threshold);
    public SearchRequest withFilterExpression(String filter);
}

Similarity Search vs. Hybrid Search

Approach	Mechanism	Strengths	Weaknesses
Semantic (Vector)	Cosine similarity on embeddings	Handles synonyms, concepts	Misses exact keyword matches
Keyword (BM25)	Full-text search, TF-IDF	Fast, exact matches	No semantic understanding
Hybrid	Combine both + rerank	Best recall and precision	More complex, slower

Best Practices

1. Chunk Documents with Overlap

Preserve context across chunk boundaries.

@Service
public class DocumentChunker {
    private static final int CHUNK_SIZE = 512; // tokens
    private static final int OVERLAP = 50;     // tokens
    
    public List<String> chunk(String document) {
        List<String> chunks = new ArrayList<>();
        String[] sentences = document.split("\\. ");
        
        StringBuilder currentChunk = new StringBuilder();
        int currentTokens = 0;
        
        for (String sentence : sentences) {
            int sentenceTokens = estimateTokens(sentence);
            
            if (currentTokens + sentenceTokens > CHUNK_SIZE) {
                chunks.add(currentChunk.toString());
                
                // Keep last OVERLAP tokens for next chunk
                currentChunk = new StringBuilder(
                    lastNTokens(currentChunk.toString(), OVERLAP)
                );
                currentTokens = OVERLAP;
            }
            
            currentChunk.append(sentence).append(". ");
            currentTokens += sentenceTokens;
        }
        
        if (currentTokens > 0) {
            chunks.add(currentChunk.toString());
        }
        
        return chunks;
    }
}

2. Store Metadata for Filtering

Enable filtered retrieval (e.g., by date, category, author).

@Service
public class DocumentIngestionService {
    private final VectorStore vectorStore;
    private final EmbeddingModel embeddingModel;
    
    public void ingest(Document doc) {
        List<String> chunks = chunkDocument(doc);
        
        for (int i = 0; i < chunks.size(); i++) {
            Map<String, Object> metadata = Map.of(
                "source", doc.getSource(),
                "category", doc.getCategory(),
                "created_at", doc.getCreatedAt().toString(),
                "chunk_index", i,
                "total_chunks", chunks.size()
            );
            
            vectorStore.add(List.of(
                new org.springframework.ai.vectorstore.Document(
                    chunks.get(i),
                    metadata
                )
            ));
        }
    }
}

3. Use Hybrid Retrieval for Best Results

Combine semantic and keyword search, then rerank.

@Service
public class HybridRetrievalService {
    private final VectorStore vectorStore;
    private final FullTextSearchEngine fullTextSearch;
    private final RerankerModel reranker;
    
    public List<Document> retrieve(String query, int topK) {
        // 1. Semantic search (top 20)
        List<Document> semanticResults = vectorStore.similaritySearch(
            SearchRequest.query(query).withTopK(topK * 2)
        );
        
        // 2. Keyword search (top 20)
        List<Document> keywordResults = fullTextSearch.search(query, topK * 2);
        
        // 3. Merge and deduplicate
        Set<Document> merged = new LinkedHashSet<>();
        merged.addAll(semanticResults);
        merged.addAll(keywordResults);
        
        // 4. Rerank with cross-encoder
        List<Document> reranked = reranker.rerank(
            query,
            new ArrayList<>(merged),
            topK
        );
        
        return reranked;
    }
}

4. Implement Reranking for Precision

Use a cross-encoder or LLM to reorder results by relevance.

@Service
public class LLMReranker {
    private final ChatModel chatModel;
    
    public List<Document> rerank(String query, List<Document> documents, int topK) {
        String prompt = String.format("""
            Rank the following documents by relevance to the query.
            Output ONLY a comma-separated list of document IDs in order.
            
            Query: %s
            
            Documents:
            %s
            
            Ranked IDs:
            """,
            query,
            formatDocuments(documents)
        );
        
        String response = chatModel.call(new Prompt(prompt))
            .getResult()
            .getOutput()
            .getContent();
        
        List<String> rankedIds = Arrays.asList(response.trim().split(","));
        
        return rankedIds.stream()
            .limit(topK)
            .map(id -> findDocumentById(id.trim(), documents))
            .filter(Objects::nonNull)
            .collect(Collectors.toList());
    }
}

5. Monitor Retrieval Quality

Track precision, recall, and relevance scores.

@Component
public class RetrievalMetrics {
    private final MeterRegistry registry;
    
    public void recordRetrievalQuality(
        String query,
        List<Document> retrieved,
        List<Document> relevant
    ) {
        Set<String> retrievedIds = retrieved.stream()
            .map(Document::getId)
            .collect(Collectors.toSet());
        
        Set<String> relevantIds = relevant.stream()
            .map(Document::getId)
            .collect(Collectors.toSet());
        
        // Precision: relevant retrieved / total retrieved
        int truePositives = (int) retrievedIds.stream()
            .filter(relevantIds::contains)
            .count();
        
        double precision = (double) truePositives / retrieved.size();
        
        // Recall: relevant retrieved / total relevant
        double recall = (double) truePositives / relevant.size();
        
        registry.gauge("retrieval.precision", precision);
        registry.gauge("retrieval.recall", recall);
    }
}

Code Examples

Example 1: Basic RAG Pipeline

@Service
public class QAService {
    private final ChatModel chatModel;
    private final VectorStore vectorStore;
    private final PromptTemplate qaTemplate;
    
    public String answer(String question) {
        // 1. Retrieve context
        List<Document> context = vectorStore.similaritySearch(
            SearchRequest.query(question).withTopK(3)
        );
        
        // 2. Build prompt
        Prompt prompt = qaTemplate.create(Map.of(
            "question", question,
            "context", context.stream()
                .map(Document::getContent)
                .collect(Collectors.joining("\n\n"))
        ));
        
        // 3. Generate answer
        return chatModel.call(prompt)
            .getResult()
            .getOutput()
            .getContent();
    }
}

Template: prompts/qa-with-context.st

Answer the question using ONLY the following context.
If the answer is not in the context, say "I don't have that information."

Context:
<context>

Question: <question>

Answer:

✅ Good for: Simple Q&A over documents
❌ Not good for: Multi-hop reasoning (requires agentic retrieval)

Example 2: Filtered Retrieval by Metadata

@Service
public class PolicySearchService {
    private final VectorStore vectorStore;
    
    public List<Document> searchPolicies(
        String query,
        String department,
        LocalDate after
    ) {
        String filter = String.format(
            "category == '%s' && created_at > '%s'",
            department,
            after.toString()
        );
        
        return vectorStore.similaritySearch(
            SearchRequest.query(query)
                .withTopK(5)
                .withFilterExpression(filter)
        );
    }
}

✅ Good for: Multi-tenant systems, temporal filtering
❌ Not good for: Complex boolean logic (use hybrid search)

Example 3: Document Ingestion with Deduplication

@Service
public class DocumentIngestionService {
    private final VectorStore vectorStore;
    private final EmbeddingModel embeddingModel;
    
    public void ingestWithDeduplication(List<Document> documents) {
        for (Document doc : documents) {
            // Check if similar document already exists
            List<Document> similar = vectorStore.similaritySearch(
                SearchRequest.query(doc.getContent())
                    .withTopK(1)
                    .withSimilarityThreshold(0.95) // 95% similar = duplicate
            );
            
            if (similar.isEmpty()) {
                vectorStore.add(List.of(doc));
                log.info("Ingested new document: {}", doc.getId());
            } else {
                log.info("Skipped duplicate document: {}", doc.getId());
            }
        }
    }
}

✅ Good for: Avoiding duplicate content
❌ Not good for: Large-scale ingestion (O(n) checks; use bloom filter)

Example 4: Multi-Query Retrieval

@Service
public class MultiQueryRetrievalService {
    private final ChatModel chatModel;
    private final VectorStore vectorStore;
    
    public List<Document> retrieve(String query, int topK) {
        // 1. Generate multiple query variations
        List<String> queryVariations = generateQueryVariations(query);
        
        // 2. Retrieve for each variation
        Set<Document> allResults = new LinkedHashSet<>();
        for (String variation : queryVariations) {
            List<Document> results = vectorStore.similaritySearch(
                SearchRequest.query(variation).withTopK(topK)
            );
            allResults.addAll(results);
        }
        
        // 3. Return top-k by score
        return allResults.stream()
            .limit(topK)
            .collect(Collectors.toList());
    }
    
    private List<String> generateQueryVariations(String query) {
        String prompt = String.format("""
            Generate 3 alternative phrasings of this question:
            
            Original: %s
            
            Alternatives (one per line):
            """, query);
        
        String response = chatModel.call(new Prompt(prompt))
            .getResult()
            .getOutput()
            .getContent();
        
        return Arrays.asList(response.split("\n"));
    }
}

✅ Good for: Improving recall
❌ Not good for: Latency-sensitive applications (3x slower)

Example 5: Contextual Compression

@Service
public class CompressedRetrievalService {
    private final VectorStore vectorStore;
    private final ChatModel chatModel;
    
    public String retrieveAndCompress(String query, int topK) {
        // 1. Retrieve more documents than needed
        List<Document> candidates = vectorStore.similaritySearch(
            SearchRequest.query(query).withTopK(topK * 3)
        );
        
        // 2. Extract only relevant sentences from each document
        List<String> compressedChunks = new ArrayList<>();
        for (Document doc : candidates) {
            String compressed = extractRelevantSentences(query, doc.getContent());
            if (!compressed.isEmpty()) {
                compressedChunks.add(compressed);
            }
        }
        
        // 3. Return concatenated compressed context
        return compressedChunks.stream()
            .limit(topK)
            .collect(Collectors.joining("\n\n"));
    }
    
    private String extractRelevantSentences(String query, String document) {
        String prompt = String.format("""
            Extract ONLY the sentences from the document that are relevant to the query.
            
            Query: %s
            
            Document:
            %s
            
            Relevant sentences:
            """, query, document);
        
        return chatModel.call(new Prompt(prompt))
            .getResult()
            .getOutput()
            .getContent();
    }
}

✅ Good for: Maximizing context window usage
❌ Not good for: Cost-sensitive applications (extra LLM calls)

Anti-Patterns

❌ No Chunking (Embedding Entire Documents)

// DON'T: Embed 10,000-word document as one vector
String fullDoc = loadEntireDocument();
vectorStore.add(List.of(new Document(fullDoc)));

Why: Embedding models have token limits; long docs lose information.

✅ DO: Chunk into 512-token segments

List<String> chunks = chunkDocument(fullDoc, 512);
for (String chunk : chunks) {
    vectorStore.add(List.of(new Document(chunk, metadata)));
}

❌ Ignoring Metadata

// DON'T: Store only text
vectorStore.add(List.of(new Document(text)));

Why: Cannot filter by date, category, or source.

✅ DO: Include metadata

vectorStore.add(List.of(new Document(
    text,
    Map.of(
        "source", "policy-manual",
        "category", "hr",
        "updated_at", "2026-01-01"
    )
)));

❌ Using Only Semantic Search

// DON'T: Miss exact keyword matches
List<Document> results = vectorStore.similaritySearch(query);

Why: Semantic search may miss specific product names, IDs, codes.

✅ DO: Use hybrid search

List<Document> semanticResults = vectorStore.similaritySearch(query);
List<Document> keywordResults = fullTextSearch.search(query);
List<Document> merged = mergeAndRerank(semanticResults, keywordResults);

❌ No Reranking

// DON'T: Trust vector similarity scores directly
return vectorStore.similaritySearch(SearchRequest.query(query).withTopK(5));

Why: Cosine similarity is a weak proxy for relevance.

✅ DO: Rerank with cross-encoder or LLM

List<Document> candidates = vectorStore.similaritySearch(
    SearchRequest.query(query).withTopK(20)
);
return reranker.rerank(query, candidates, 5);

Testing Strategies

Unit Testing Chunking Logic

@Test
void shouldChunkDocumentWithOverlap() {
    String doc = "Sentence 1. Sentence 2. Sentence 3. Sentence 4.";
    List<String> chunks = chunker.chunk(doc, 2, 1); // 2 sentences, 1 overlap
    
    assertEquals(3, chunks.size());
    assertTrue(chunks.get(1).contains("Sentence 2")); // Overlap preserved
}

Integration Testing Retrieval

@SpringBootTest
class RetrievalIntegrationTest {
    @Autowired
    private VectorStore vectorStore;
    
    @BeforeEach
    void setup() {
        vectorStore.add(List.of(
            new Document("Java is a programming language"),
            new Document("Python is also a programming language"),
            new Document("Spring Boot simplifies Java development")
        ));
    }
    
    @Test
    void shouldRetrieveRelevantDocuments() {
        List<Document> results = vectorStore.similaritySearch(
            SearchRequest.query("What is Java?").withTopK(2)
        );
        
        assertEquals(2, results.size());
        assertTrue(results.get(0).getContent().contains("Java"));
    }
}

Golden Dataset Evaluation

@Test
void shouldAchieveTargetRecall() {
    List<TestCase> goldenDataset = loadGoldenDataset();
    int correctRetrievals = 0;
    
    for (TestCase test : goldenDataset) {
        List<Document> retrieved = vectorStore.similaritySearch(
            SearchRequest.query(test.getQuery()).withTopK(5)
        );
        
        if (retrieved.stream().anyMatch(d -> 
            test.getRelevantDocIds().contains(d.getId()))) {
            correctRetrievals++;
        }
    }
    
    double recall = (double) correctRetrievals / goldenDataset.size();
    assertTrue(recall >= 0.90, "Recall below 90%: " + recall);
}

Performance Considerations

Concern	Strategy
Latency	Use approximate nearest neighbor (HNSW, IVF); cache frequent queries
Scale	Shard vector store; use dimensionality reduction if needed
Cost	Embed offline; batch queries; use smaller embedding models
Recall	Increase `k` and rerank; use hybrid search
Precision	Use reranking; tune similarity threshold

References

embedding-models.md — Vector generation
chat-models.md — Response generation
prompt-templates.md — RAG prompt design
evaluation.md — Measuring retrieval quality
ai-ml/rag-patterns.md — Advanced RAG architectures

Retrieval (RAG)

Retrieval (RAG)

Overview

Key Concepts

RAG Architecture

VectorStore Interface

Similarity Search vs. Hybrid Search

Best Practices

1. Chunk Documents with Overlap

2. Store Metadata for Filtering

3. Use Hybrid Retrieval for Best Results

4. Implement Reranking for Precision

5. Monitor Retrieval Quality

Code Examples

Example 1: Basic RAG Pipeline

Example 2: Filtered Retrieval by Metadata

Example 3: Document Ingestion with Deduplication

Example 4: Multi-Query Retrieval

Example 5: Contextual Compression

Anti-Patterns

❌ No Chunking (Embedding Entire Documents)

❌ Ignoring Metadata

❌ Using Only Semantic Search

❌ No Reranking

Testing Strategies

Unit Testing Chunking Logic

Integration Testing Retrieval

Golden Dataset Evaluation

Performance Considerations

References

Related Skills