Skip to content
Home / Skills / Spring Ai / Retrieval (RAG)
SP

Retrieval (RAG)

Spring Ai rag v1.0.0

Retrieval (RAG)

Overview

Retrieval-Augmented Generation (RAG) grounds LLM responses in domain-specific knowledge by retrieving relevant context before generation. Spring AI’s VectorStore abstraction enables semantic search across embeddings, supporting in-memory, PostgreSQL (pgvector), Pinecone, Weaviate, and other vector databases. Effective retrieval is the most critical component of RAG quality.


Key Concepts

RAG Architecture

┌─────────────────────────────────────────────────────────────┐
│                  RAG Pipeline Architecture                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  INGESTION (Offline)                                         │
│  ────────────────────                                        │
│  1. Load documents (PDF, HTML, Markdown, etc.)               │
│  2. Chunk into 512-token segments with overlap               │
│  3. Generate embeddings via EmbeddingModel                   │
│  4. Store in VectorStore with metadata                       │
│                                                              │
│  RETRIEVAL (Online)                                          │
│  ───────────────────                                         │
│  1. User query → embed query                                 │
│  2. VectorStore.similaritySearch(query, k=5)                 │
│  3. (Optional) Rerank results by relevance                   │
│  4. Assemble context from top-k documents                    │
│  5. Build prompt: system + context + query                   │
│  6. ChatModel.call(prompt) → Answer                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

VectorStore Interface

public interface VectorStore {
    void add(List<Document> documents);
    Optional<Boolean> delete(List<String> idList);
    List<Document> similaritySearch(SearchRequest request);
}

public class SearchRequest {
    public static SearchRequest query(String query);
    public SearchRequest withTopK(int k);
    public SearchRequest withSimilarityThreshold(double threshold);
    public SearchRequest withFilterExpression(String filter);
}
ApproachMechanismStrengthsWeaknesses
Semantic (Vector)Cosine similarity on embeddingsHandles synonyms, conceptsMisses exact keyword matches
Keyword (BM25)Full-text search, TF-IDFFast, exact matchesNo semantic understanding
HybridCombine both + rerankBest recall and precisionMore complex, slower

Best Practices

1. Chunk Documents with Overlap

Preserve context across chunk boundaries.

@Service
public class DocumentChunker {
    private static final int CHUNK_SIZE = 512; // tokens
    private static final int OVERLAP = 50;     // tokens
    
    public List<String> chunk(String document) {
        List<String> chunks = new ArrayList<>();
        String[] sentences = document.split("\\. ");
        
        StringBuilder currentChunk = new StringBuilder();
        int currentTokens = 0;
        
        for (String sentence : sentences) {
            int sentenceTokens = estimateTokens(sentence);
            
            if (currentTokens + sentenceTokens > CHUNK_SIZE) {
                chunks.add(currentChunk.toString());
                
                // Keep last OVERLAP tokens for next chunk
                currentChunk = new StringBuilder(
                    lastNTokens(currentChunk.toString(), OVERLAP)
                );
                currentTokens = OVERLAP;
            }
            
            currentChunk.append(sentence).append(". ");
            currentTokens += sentenceTokens;
        }
        
        if (currentTokens > 0) {
            chunks.add(currentChunk.toString());
        }
        
        return chunks;
    }
}

2. Store Metadata for Filtering

Enable filtered retrieval (e.g., by date, category, author).

@Service
public class DocumentIngestionService {
    private final VectorStore vectorStore;
    private final EmbeddingModel embeddingModel;
    
    public void ingest(Document doc) {
        List<String> chunks = chunkDocument(doc);
        
        for (int i = 0; i < chunks.size(); i++) {
            Map<String, Object> metadata = Map.of(
                "source", doc.getSource(),
                "category", doc.getCategory(),
                "created_at", doc.getCreatedAt().toString(),
                "chunk_index", i,
                "total_chunks", chunks.size()
            );
            
            vectorStore.add(List.of(
                new org.springframework.ai.vectorstore.Document(
                    chunks.get(i),
                    metadata
                )
            ));
        }
    }
}

3. Use Hybrid Retrieval for Best Results

Combine semantic and keyword search, then rerank.

@Service
public class HybridRetrievalService {
    private final VectorStore vectorStore;
    private final FullTextSearchEngine fullTextSearch;
    private final RerankerModel reranker;
    
    public List<Document> retrieve(String query, int topK) {
        // 1. Semantic search (top 20)
        List<Document> semanticResults = vectorStore.similaritySearch(
            SearchRequest.query(query).withTopK(topK * 2)
        );
        
        // 2. Keyword search (top 20)
        List<Document> keywordResults = fullTextSearch.search(query, topK * 2);
        
        // 3. Merge and deduplicate
        Set<Document> merged = new LinkedHashSet<>();
        merged.addAll(semanticResults);
        merged.addAll(keywordResults);
        
        // 4. Rerank with cross-encoder
        List<Document> reranked = reranker.rerank(
            query,
            new ArrayList<>(merged),
            topK
        );
        
        return reranked;
    }
}

4. Implement Reranking for Precision

Use a cross-encoder or LLM to reorder results by relevance.

@Service
public class LLMReranker {
    private final ChatModel chatModel;
    
    public List<Document> rerank(String query, List<Document> documents, int topK) {
        String prompt = String.format("""
            Rank the following documents by relevance to the query.
            Output ONLY a comma-separated list of document IDs in order.
            
            Query: %s
            
            Documents:
            %s
            
            Ranked IDs:
            """,
            query,
            formatDocuments(documents)
        );
        
        String response = chatModel.call(new Prompt(prompt))
            .getResult()
            .getOutput()
            .getContent();
        
        List<String> rankedIds = Arrays.asList(response.trim().split(","));
        
        return rankedIds.stream()
            .limit(topK)
            .map(id -> findDocumentById(id.trim(), documents))
            .filter(Objects::nonNull)
            .collect(Collectors.toList());
    }
}

5. Monitor Retrieval Quality

Track precision, recall, and relevance scores.

@Component
public class RetrievalMetrics {
    private final MeterRegistry registry;
    
    public void recordRetrievalQuality(
        String query,
        List<Document> retrieved,
        List<Document> relevant
    ) {
        Set<String> retrievedIds = retrieved.stream()
            .map(Document::getId)
            .collect(Collectors.toSet());
        
        Set<String> relevantIds = relevant.stream()
            .map(Document::getId)
            .collect(Collectors.toSet());
        
        // Precision: relevant retrieved / total retrieved
        int truePositives = (int) retrievedIds.stream()
            .filter(relevantIds::contains)
            .count();
        
        double precision = (double) truePositives / retrieved.size();
        
        // Recall: relevant retrieved / total relevant
        double recall = (double) truePositives / relevant.size();
        
        registry.gauge("retrieval.precision", precision);
        registry.gauge("retrieval.recall", recall);
    }
}

Code Examples

Example 1: Basic RAG Pipeline

@Service
public class QAService {
    private final ChatModel chatModel;
    private final VectorStore vectorStore;
    private final PromptTemplate qaTemplate;
    
    public String answer(String question) {
        // 1. Retrieve context
        List<Document> context = vectorStore.similaritySearch(
            SearchRequest.query(question).withTopK(3)
        );
        
        // 2. Build prompt
        Prompt prompt = qaTemplate.create(Map.of(
            "question", question,
            "context", context.stream()
                .map(Document::getContent)
                .collect(Collectors.joining("\n\n"))
        ));
        
        // 3. Generate answer
        return chatModel.call(prompt)
            .getResult()
            .getOutput()
            .getContent();
    }
}

Template: prompts/qa-with-context.st

Answer the question using ONLY the following context.
If the answer is not in the context, say "I don't have that information."

Context:
<context>

Question: <question>

Answer:

✅ Good for: Simple Q&A over documents
❌ Not good for: Multi-hop reasoning (requires agentic retrieval)


Example 2: Filtered Retrieval by Metadata

@Service
public class PolicySearchService {
    private final VectorStore vectorStore;
    
    public List<Document> searchPolicies(
        String query,
        String department,
        LocalDate after
    ) {
        String filter = String.format(
            "category == '%s' && created_at > '%s'",
            department,
            after.toString()
        );
        
        return vectorStore.similaritySearch(
            SearchRequest.query(query)
                .withTopK(5)
                .withFilterExpression(filter)
        );
    }
}

✅ Good for: Multi-tenant systems, temporal filtering
❌ Not good for: Complex boolean logic (use hybrid search)


Example 3: Document Ingestion with Deduplication

@Service
public class DocumentIngestionService {
    private final VectorStore vectorStore;
    private final EmbeddingModel embeddingModel;
    
    public void ingestWithDeduplication(List<Document> documents) {
        for (Document doc : documents) {
            // Check if similar document already exists
            List<Document> similar = vectorStore.similaritySearch(
                SearchRequest.query(doc.getContent())
                    .withTopK(1)
                    .withSimilarityThreshold(0.95) // 95% similar = duplicate
            );
            
            if (similar.isEmpty()) {
                vectorStore.add(List.of(doc));
                log.info("Ingested new document: {}", doc.getId());
            } else {
                log.info("Skipped duplicate document: {}", doc.getId());
            }
        }
    }
}

✅ Good for: Avoiding duplicate content
❌ Not good for: Large-scale ingestion (O(n) checks; use bloom filter)


Example 4: Multi-Query Retrieval

@Service
public class MultiQueryRetrievalService {
    private final ChatModel chatModel;
    private final VectorStore vectorStore;
    
    public List<Document> retrieve(String query, int topK) {
        // 1. Generate multiple query variations
        List<String> queryVariations = generateQueryVariations(query);
        
        // 2. Retrieve for each variation
        Set<Document> allResults = new LinkedHashSet<>();
        for (String variation : queryVariations) {
            List<Document> results = vectorStore.similaritySearch(
                SearchRequest.query(variation).withTopK(topK)
            );
            allResults.addAll(results);
        }
        
        // 3. Return top-k by score
        return allResults.stream()
            .limit(topK)
            .collect(Collectors.toList());
    }
    
    private List<String> generateQueryVariations(String query) {
        String prompt = String.format("""
            Generate 3 alternative phrasings of this question:
            
            Original: %s
            
            Alternatives (one per line):
            """, query);
        
        String response = chatModel.call(new Prompt(prompt))
            .getResult()
            .getOutput()
            .getContent();
        
        return Arrays.asList(response.split("\n"));
    }
}

✅ Good for: Improving recall
❌ Not good for: Latency-sensitive applications (3x slower)


Example 5: Contextual Compression

@Service
public class CompressedRetrievalService {
    private final VectorStore vectorStore;
    private final ChatModel chatModel;
    
    public String retrieveAndCompress(String query, int topK) {
        // 1. Retrieve more documents than needed
        List<Document> candidates = vectorStore.similaritySearch(
            SearchRequest.query(query).withTopK(topK * 3)
        );
        
        // 2. Extract only relevant sentences from each document
        List<String> compressedChunks = new ArrayList<>();
        for (Document doc : candidates) {
            String compressed = extractRelevantSentences(query, doc.getContent());
            if (!compressed.isEmpty()) {
                compressedChunks.add(compressed);
            }
        }
        
        // 3. Return concatenated compressed context
        return compressedChunks.stream()
            .limit(topK)
            .collect(Collectors.joining("\n\n"));
    }
    
    private String extractRelevantSentences(String query, String document) {
        String prompt = String.format("""
            Extract ONLY the sentences from the document that are relevant to the query.
            
            Query: %s
            
            Document:
            %s
            
            Relevant sentences:
            """, query, document);
        
        return chatModel.call(new Prompt(prompt))
            .getResult()
            .getOutput()
            .getContent();
    }
}

✅ Good for: Maximizing context window usage
❌ Not good for: Cost-sensitive applications (extra LLM calls)


Anti-Patterns

❌ No Chunking (Embedding Entire Documents)

// DON'T: Embed 10,000-word document as one vector
String fullDoc = loadEntireDocument();
vectorStore.add(List.of(new Document(fullDoc)));

Why: Embedding models have token limits; long docs lose information.

✅ DO: Chunk into 512-token segments

List<String> chunks = chunkDocument(fullDoc, 512);
for (String chunk : chunks) {
    vectorStore.add(List.of(new Document(chunk, metadata)));
}

❌ Ignoring Metadata

// DON'T: Store only text
vectorStore.add(List.of(new Document(text)));

Why: Cannot filter by date, category, or source.

✅ DO: Include metadata

vectorStore.add(List.of(new Document(
    text,
    Map.of(
        "source", "policy-manual",
        "category", "hr",
        "updated_at", "2026-01-01"
    )
)));

// DON'T: Miss exact keyword matches
List<Document> results = vectorStore.similaritySearch(query);

Why: Semantic search may miss specific product names, IDs, codes.

✅ DO: Use hybrid search

List<Document> semanticResults = vectorStore.similaritySearch(query);
List<Document> keywordResults = fullTextSearch.search(query);
List<Document> merged = mergeAndRerank(semanticResults, keywordResults);

❌ No Reranking

// DON'T: Trust vector similarity scores directly
return vectorStore.similaritySearch(SearchRequest.query(query).withTopK(5));

Why: Cosine similarity is a weak proxy for relevance.

✅ DO: Rerank with cross-encoder or LLM

List<Document> candidates = vectorStore.similaritySearch(
    SearchRequest.query(query).withTopK(20)
);
return reranker.rerank(query, candidates, 5);

Testing Strategies

Unit Testing Chunking Logic

@Test
void shouldChunkDocumentWithOverlap() {
    String doc = "Sentence 1. Sentence 2. Sentence 3. Sentence 4.";
    List<String> chunks = chunker.chunk(doc, 2, 1); // 2 sentences, 1 overlap
    
    assertEquals(3, chunks.size());
    assertTrue(chunks.get(1).contains("Sentence 2")); // Overlap preserved
}

Integration Testing Retrieval

@SpringBootTest
class RetrievalIntegrationTest {
    @Autowired
    private VectorStore vectorStore;
    
    @BeforeEach
    void setup() {
        vectorStore.add(List.of(
            new Document("Java is a programming language"),
            new Document("Python is also a programming language"),
            new Document("Spring Boot simplifies Java development")
        ));
    }
    
    @Test
    void shouldRetrieveRelevantDocuments() {
        List<Document> results = vectorStore.similaritySearch(
            SearchRequest.query("What is Java?").withTopK(2)
        );
        
        assertEquals(2, results.size());
        assertTrue(results.get(0).getContent().contains("Java"));
    }
}

Golden Dataset Evaluation

@Test
void shouldAchieveTargetRecall() {
    List<TestCase> goldenDataset = loadGoldenDataset();
    int correctRetrievals = 0;
    
    for (TestCase test : goldenDataset) {
        List<Document> retrieved = vectorStore.similaritySearch(
            SearchRequest.query(test.getQuery()).withTopK(5)
        );
        
        if (retrieved.stream().anyMatch(d -> 
            test.getRelevantDocIds().contains(d.getId()))) {
            correctRetrievals++;
        }
    }
    
    double recall = (double) correctRetrievals / goldenDataset.size();
    assertTrue(recall >= 0.90, "Recall below 90%: " + recall);
}

Performance Considerations

ConcernStrategy
LatencyUse approximate nearest neighbor (HNSW, IVF); cache frequent queries
ScaleShard vector store; use dimensionality reduction if needed
CostEmbed offline; batch queries; use smaller embedding models
RecallIncrease k and rerank; use hybrid search
PrecisionUse reranking; tune similarity threshold

References