Retrieval (RAG)
Retrieval (RAG)
Overview
Retrieval-Augmented Generation (RAG) grounds LLM responses in domain-specific knowledge by retrieving relevant context before generation. Spring AI’s VectorStore abstraction enables semantic search across embeddings, supporting in-memory, PostgreSQL (pgvector), Pinecone, Weaviate, and other vector databases. Effective retrieval is the most critical component of RAG quality.
Key Concepts
RAG Architecture
┌─────────────────────────────────────────────────────────────┐
│ RAG Pipeline Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ INGESTION (Offline) │
│ ──────────────────── │
│ 1. Load documents (PDF, HTML, Markdown, etc.) │
│ 2. Chunk into 512-token segments with overlap │
│ 3. Generate embeddings via EmbeddingModel │
│ 4. Store in VectorStore with metadata │
│ │
│ RETRIEVAL (Online) │
│ ─────────────────── │
│ 1. User query → embed query │
│ 2. VectorStore.similaritySearch(query, k=5) │
│ 3. (Optional) Rerank results by relevance │
│ 4. Assemble context from top-k documents │
│ 5. Build prompt: system + context + query │
│ 6. ChatModel.call(prompt) → Answer │
│ │
└─────────────────────────────────────────────────────────────┘
VectorStore Interface
public interface VectorStore {
void add(List<Document> documents);
Optional<Boolean> delete(List<String> idList);
List<Document> similaritySearch(SearchRequest request);
}
public class SearchRequest {
public static SearchRequest query(String query);
public SearchRequest withTopK(int k);
public SearchRequest withSimilarityThreshold(double threshold);
public SearchRequest withFilterExpression(String filter);
}
Similarity Search vs. Hybrid Search
| Approach | Mechanism | Strengths | Weaknesses |
|---|---|---|---|
| Semantic (Vector) | Cosine similarity on embeddings | Handles synonyms, concepts | Misses exact keyword matches |
| Keyword (BM25) | Full-text search, TF-IDF | Fast, exact matches | No semantic understanding |
| Hybrid | Combine both + rerank | Best recall and precision | More complex, slower |
Best Practices
1. Chunk Documents with Overlap
Preserve context across chunk boundaries.
@Service
public class DocumentChunker {
private static final int CHUNK_SIZE = 512; // tokens
private static final int OVERLAP = 50; // tokens
public List<String> chunk(String document) {
List<String> chunks = new ArrayList<>();
String[] sentences = document.split("\\. ");
StringBuilder currentChunk = new StringBuilder();
int currentTokens = 0;
for (String sentence : sentences) {
int sentenceTokens = estimateTokens(sentence);
if (currentTokens + sentenceTokens > CHUNK_SIZE) {
chunks.add(currentChunk.toString());
// Keep last OVERLAP tokens for next chunk
currentChunk = new StringBuilder(
lastNTokens(currentChunk.toString(), OVERLAP)
);
currentTokens = OVERLAP;
}
currentChunk.append(sentence).append(". ");
currentTokens += sentenceTokens;
}
if (currentTokens > 0) {
chunks.add(currentChunk.toString());
}
return chunks;
}
}
2. Store Metadata for Filtering
Enable filtered retrieval (e.g., by date, category, author).
@Service
public class DocumentIngestionService {
private final VectorStore vectorStore;
private final EmbeddingModel embeddingModel;
public void ingest(Document doc) {
List<String> chunks = chunkDocument(doc);
for (int i = 0; i < chunks.size(); i++) {
Map<String, Object> metadata = Map.of(
"source", doc.getSource(),
"category", doc.getCategory(),
"created_at", doc.getCreatedAt().toString(),
"chunk_index", i,
"total_chunks", chunks.size()
);
vectorStore.add(List.of(
new org.springframework.ai.vectorstore.Document(
chunks.get(i),
metadata
)
));
}
}
}
3. Use Hybrid Retrieval for Best Results
Combine semantic and keyword search, then rerank.
@Service
public class HybridRetrievalService {
private final VectorStore vectorStore;
private final FullTextSearchEngine fullTextSearch;
private final RerankerModel reranker;
public List<Document> retrieve(String query, int topK) {
// 1. Semantic search (top 20)
List<Document> semanticResults = vectorStore.similaritySearch(
SearchRequest.query(query).withTopK(topK * 2)
);
// 2. Keyword search (top 20)
List<Document> keywordResults = fullTextSearch.search(query, topK * 2);
// 3. Merge and deduplicate
Set<Document> merged = new LinkedHashSet<>();
merged.addAll(semanticResults);
merged.addAll(keywordResults);
// 4. Rerank with cross-encoder
List<Document> reranked = reranker.rerank(
query,
new ArrayList<>(merged),
topK
);
return reranked;
}
}
4. Implement Reranking for Precision
Use a cross-encoder or LLM to reorder results by relevance.
@Service
public class LLMReranker {
private final ChatModel chatModel;
public List<Document> rerank(String query, List<Document> documents, int topK) {
String prompt = String.format("""
Rank the following documents by relevance to the query.
Output ONLY a comma-separated list of document IDs in order.
Query: %s
Documents:
%s
Ranked IDs:
""",
query,
formatDocuments(documents)
);
String response = chatModel.call(new Prompt(prompt))
.getResult()
.getOutput()
.getContent();
List<String> rankedIds = Arrays.asList(response.trim().split(","));
return rankedIds.stream()
.limit(topK)
.map(id -> findDocumentById(id.trim(), documents))
.filter(Objects::nonNull)
.collect(Collectors.toList());
}
}
5. Monitor Retrieval Quality
Track precision, recall, and relevance scores.
@Component
public class RetrievalMetrics {
private final MeterRegistry registry;
public void recordRetrievalQuality(
String query,
List<Document> retrieved,
List<Document> relevant
) {
Set<String> retrievedIds = retrieved.stream()
.map(Document::getId)
.collect(Collectors.toSet());
Set<String> relevantIds = relevant.stream()
.map(Document::getId)
.collect(Collectors.toSet());
// Precision: relevant retrieved / total retrieved
int truePositives = (int) retrievedIds.stream()
.filter(relevantIds::contains)
.count();
double precision = (double) truePositives / retrieved.size();
// Recall: relevant retrieved / total relevant
double recall = (double) truePositives / relevant.size();
registry.gauge("retrieval.precision", precision);
registry.gauge("retrieval.recall", recall);
}
}
Code Examples
Example 1: Basic RAG Pipeline
@Service
public class QAService {
private final ChatModel chatModel;
private final VectorStore vectorStore;
private final PromptTemplate qaTemplate;
public String answer(String question) {
// 1. Retrieve context
List<Document> context = vectorStore.similaritySearch(
SearchRequest.query(question).withTopK(3)
);
// 2. Build prompt
Prompt prompt = qaTemplate.create(Map.of(
"question", question,
"context", context.stream()
.map(Document::getContent)
.collect(Collectors.joining("\n\n"))
));
// 3. Generate answer
return chatModel.call(prompt)
.getResult()
.getOutput()
.getContent();
}
}
Template: prompts/qa-with-context.st
Answer the question using ONLY the following context.
If the answer is not in the context, say "I don't have that information."
Context:
<context>
Question: <question>
Answer:
✅ Good for: Simple Q&A over documents
❌ Not good for: Multi-hop reasoning (requires agentic retrieval)
Example 2: Filtered Retrieval by Metadata
@Service
public class PolicySearchService {
private final VectorStore vectorStore;
public List<Document> searchPolicies(
String query,
String department,
LocalDate after
) {
String filter = String.format(
"category == '%s' && created_at > '%s'",
department,
after.toString()
);
return vectorStore.similaritySearch(
SearchRequest.query(query)
.withTopK(5)
.withFilterExpression(filter)
);
}
}
✅ Good for: Multi-tenant systems, temporal filtering
❌ Not good for: Complex boolean logic (use hybrid search)
Example 3: Document Ingestion with Deduplication
@Service
public class DocumentIngestionService {
private final VectorStore vectorStore;
private final EmbeddingModel embeddingModel;
public void ingestWithDeduplication(List<Document> documents) {
for (Document doc : documents) {
// Check if similar document already exists
List<Document> similar = vectorStore.similaritySearch(
SearchRequest.query(doc.getContent())
.withTopK(1)
.withSimilarityThreshold(0.95) // 95% similar = duplicate
);
if (similar.isEmpty()) {
vectorStore.add(List.of(doc));
log.info("Ingested new document: {}", doc.getId());
} else {
log.info("Skipped duplicate document: {}", doc.getId());
}
}
}
}
✅ Good for: Avoiding duplicate content
❌ Not good for: Large-scale ingestion (O(n) checks; use bloom filter)
Example 4: Multi-Query Retrieval
@Service
public class MultiQueryRetrievalService {
private final ChatModel chatModel;
private final VectorStore vectorStore;
public List<Document> retrieve(String query, int topK) {
// 1. Generate multiple query variations
List<String> queryVariations = generateQueryVariations(query);
// 2. Retrieve for each variation
Set<Document> allResults = new LinkedHashSet<>();
for (String variation : queryVariations) {
List<Document> results = vectorStore.similaritySearch(
SearchRequest.query(variation).withTopK(topK)
);
allResults.addAll(results);
}
// 3. Return top-k by score
return allResults.stream()
.limit(topK)
.collect(Collectors.toList());
}
private List<String> generateQueryVariations(String query) {
String prompt = String.format("""
Generate 3 alternative phrasings of this question:
Original: %s
Alternatives (one per line):
""", query);
String response = chatModel.call(new Prompt(prompt))
.getResult()
.getOutput()
.getContent();
return Arrays.asList(response.split("\n"));
}
}
✅ Good for: Improving recall
❌ Not good for: Latency-sensitive applications (3x slower)
Example 5: Contextual Compression
@Service
public class CompressedRetrievalService {
private final VectorStore vectorStore;
private final ChatModel chatModel;
public String retrieveAndCompress(String query, int topK) {
// 1. Retrieve more documents than needed
List<Document> candidates = vectorStore.similaritySearch(
SearchRequest.query(query).withTopK(topK * 3)
);
// 2. Extract only relevant sentences from each document
List<String> compressedChunks = new ArrayList<>();
for (Document doc : candidates) {
String compressed = extractRelevantSentences(query, doc.getContent());
if (!compressed.isEmpty()) {
compressedChunks.add(compressed);
}
}
// 3. Return concatenated compressed context
return compressedChunks.stream()
.limit(topK)
.collect(Collectors.joining("\n\n"));
}
private String extractRelevantSentences(String query, String document) {
String prompt = String.format("""
Extract ONLY the sentences from the document that are relevant to the query.
Query: %s
Document:
%s
Relevant sentences:
""", query, document);
return chatModel.call(new Prompt(prompt))
.getResult()
.getOutput()
.getContent();
}
}
✅ Good for: Maximizing context window usage
❌ Not good for: Cost-sensitive applications (extra LLM calls)
Anti-Patterns
❌ No Chunking (Embedding Entire Documents)
// DON'T: Embed 10,000-word document as one vector
String fullDoc = loadEntireDocument();
vectorStore.add(List.of(new Document(fullDoc)));
Why: Embedding models have token limits; long docs lose information.
✅ DO: Chunk into 512-token segments
List<String> chunks = chunkDocument(fullDoc, 512);
for (String chunk : chunks) {
vectorStore.add(List.of(new Document(chunk, metadata)));
}
❌ Ignoring Metadata
// DON'T: Store only text
vectorStore.add(List.of(new Document(text)));
Why: Cannot filter by date, category, or source.
✅ DO: Include metadata
vectorStore.add(List.of(new Document(
text,
Map.of(
"source", "policy-manual",
"category", "hr",
"updated_at", "2026-01-01"
)
)));
❌ Using Only Semantic Search
// DON'T: Miss exact keyword matches
List<Document> results = vectorStore.similaritySearch(query);
Why: Semantic search may miss specific product names, IDs, codes.
✅ DO: Use hybrid search
List<Document> semanticResults = vectorStore.similaritySearch(query);
List<Document> keywordResults = fullTextSearch.search(query);
List<Document> merged = mergeAndRerank(semanticResults, keywordResults);
❌ No Reranking
// DON'T: Trust vector similarity scores directly
return vectorStore.similaritySearch(SearchRequest.query(query).withTopK(5));
Why: Cosine similarity is a weak proxy for relevance.
✅ DO: Rerank with cross-encoder or LLM
List<Document> candidates = vectorStore.similaritySearch(
SearchRequest.query(query).withTopK(20)
);
return reranker.rerank(query, candidates, 5);
Testing Strategies
Unit Testing Chunking Logic
@Test
void shouldChunkDocumentWithOverlap() {
String doc = "Sentence 1. Sentence 2. Sentence 3. Sentence 4.";
List<String> chunks = chunker.chunk(doc, 2, 1); // 2 sentences, 1 overlap
assertEquals(3, chunks.size());
assertTrue(chunks.get(1).contains("Sentence 2")); // Overlap preserved
}
Integration Testing Retrieval
@SpringBootTest
class RetrievalIntegrationTest {
@Autowired
private VectorStore vectorStore;
@BeforeEach
void setup() {
vectorStore.add(List.of(
new Document("Java is a programming language"),
new Document("Python is also a programming language"),
new Document("Spring Boot simplifies Java development")
));
}
@Test
void shouldRetrieveRelevantDocuments() {
List<Document> results = vectorStore.similaritySearch(
SearchRequest.query("What is Java?").withTopK(2)
);
assertEquals(2, results.size());
assertTrue(results.get(0).getContent().contains("Java"));
}
}
Golden Dataset Evaluation
@Test
void shouldAchieveTargetRecall() {
List<TestCase> goldenDataset = loadGoldenDataset();
int correctRetrievals = 0;
for (TestCase test : goldenDataset) {
List<Document> retrieved = vectorStore.similaritySearch(
SearchRequest.query(test.getQuery()).withTopK(5)
);
if (retrieved.stream().anyMatch(d ->
test.getRelevantDocIds().contains(d.getId()))) {
correctRetrievals++;
}
}
double recall = (double) correctRetrievals / goldenDataset.size();
assertTrue(recall >= 0.90, "Recall below 90%: " + recall);
}
Performance Considerations
| Concern | Strategy |
|---|---|
| Latency | Use approximate nearest neighbor (HNSW, IVF); cache frequent queries |
| Scale | Shard vector store; use dimensionality reduction if needed |
| Cost | Embed offline; batch queries; use smaller embedding models |
| Recall | Increase k and rerank; use hybrid search |
| Precision | Use reranking; tune similarity threshold |
References
- Spring AI Documentation - VectorStore
- LangChain RAG Guide
- Pinecone: What is RAG?
- BEIR Benchmark — Retrieval evaluation
Related Skills
embedding-models.md— Vector generationchat-models.md— Response generationprompt-templates.md— RAG prompt designevaluation.md— Measuring retrieval qualityai-ml/rag-patterns.md— Advanced RAG architectures