Embedding Models
Embedding Models
Overview
Embedding models convert text into dense vector representations (e.g., float[1536]) that capture semantic meaning. Spring AI’s EmbeddingModel abstraction enables vector-based similarity search, clustering, classification, and retrieval-augmented generation (RAG) pipelines. Embeddings are the foundation of semantic search and are essential for grounding LLM responses in domain-specific knowledge.
Key Concepts
EmbeddingModel Interface
public interface EmbeddingModel extends Model<EmbeddingRequest, EmbeddingResponse> {
EmbeddingResponse call(EmbeddingRequest request);
// Convenience method
default List<Double> embed(String text) {
return call(new EmbeddingRequest(List.of(text), null))
.getResults()
.get(0)
.getOutput();
}
}
Embedding Dimensionality Tradeoffs
┌─────────────────────────────────────────────────────────────┐
│ Embedding Model Comparison │
├─────────────────────────────────────────────────────────────┤
│ │
│ Model Dims Cost/1M Recall Use Case │
│ ───── ──── ──────── ────── ──────── │
│ │
│ text-embedding-3-small 1536 $0.02 Good High volume │
│ text-embedding-3-large 3072 $0.13 Better High accuracy │
│ text-embedding-ada-002 1536 $0.10 Good Legacy │
│ voyage-2 1024 $0.10 Good Specialized │
│ e5-mistral-7b-instruct 4096 Free Best Local/private │
│ │
└─────────────────────────────────────────────────────────────┘
Dimensionality Impact:
- Higher dimensions → Better recall, higher storage cost, slower search
- Lower dimensions → Lower storage, faster search, potentially lower recall
Similarity Metrics
┌──────────────────────────────────────────────────────┐
│ Vector Similarity Metrics │
├──────────────────────────────────────────────────────┤
│ │
│ COSINE SIMILARITY (Most Common) │
│ ───────────────── │
│ Range: [-1, 1] (1 = identical, 0 = orthogonal) │
│ Use: Text, semantic search │
│ Formula: A · B / (||A|| * ||B||) │
│ │
│ EUCLIDEAN DISTANCE │
│ ────────────────── │
│ Range: [0, ∞] (0 = identical, ∞ = very different) │
│ Use: Image embeddings, when magnitude matters │
│ Formula: sqrt(Σ(A_i - B_i)²) │
│ │
│ DOT PRODUCT │
│ ─────────── │
│ Range: [-∞, ∞] │
│ Use: When vectors are normalized │
│ Formula: Σ(A_i * B_i) │
│ │
└──────────────────────────────────────────────────────┘
Best Practices
1. Choose Embedding Model Based on Use Case
Match model dimensions and cost to retrieval requirements.
@Configuration
public class EmbeddingConfig {
@Bean
@ConditionalOnProperty(name = "embedding.mode", havingValue = "high-volume")
public EmbeddingModel smallEmbeddingModel() {
return new OpenAiEmbeddingModel(
openAiApi,
EmbeddingOptions.builder()
.withModel("text-embedding-3-small") // 1536 dims, cheap
.build()
);
}
@Bean
@ConditionalOnProperty(name = "embedding.mode", havingValue = "high-accuracy")
public EmbeddingModel largeEmbeddingModel() {
return new OpenAiEmbeddingModel(
openAiApi,
EmbeddingOptions.builder()
.withModel("text-embedding-3-large") // 3072 dims, better recall
.build()
);
}
}
2. Batch Embeddings for Efficiency
Embed multiple documents in a single API call to reduce latency and cost.
@Service
public class DocumentEmbeddingService {
private final EmbeddingModel embeddingModel;
public List<float[]> embedBatch(List<String> documents) {
// Single API call for up to 2048 documents (OpenAI limit)
EmbeddingResponse response = embeddingModel.call(
new EmbeddingRequest(documents, null)
);
return response.getResults().stream()
.map(result -> toFloatArray(result.getOutput()))
.collect(Collectors.toList());
}
}
3. Normalize Vectors for Cosine Similarity
Most vector databases expect normalized vectors for cosine similarity.
public float[] normalizeVector(List<Double> vector) {
double magnitude = Math.sqrt(
vector.stream().mapToDouble(v -> v * v).sum()
);
return vector.stream()
.map(v -> (float) (v / magnitude))
.mapToDouble(Float::doubleValue)
.toArray();
}
4. Cache Embeddings to Avoid Recomputation
Embeddings for static content should be computed once and stored.
@Service
public class CachedEmbeddingService {
private final EmbeddingModel embeddingModel;
private final EmbeddingCache cache;
public float[] embed(String text) {
return cache.computeIfAbsent(text, key -> {
List<Double> embedding = embeddingModel.embed(text);
return toFloatArray(embedding);
});
}
}
5. Monitor Embedding Drift for Document Updates
When documents change, re-embed and update the vector store.
@Scheduled(cron = "0 0 2 * * ?") // Daily at 2 AM
public void updateStaleEmbeddings() {
List<Document> updatedDocs = documentRepository.findUpdatedSince(
LocalDateTime.now().minusDays(1)
);
for (Document doc : updatedDocs) {
float[] embedding = embeddingService.embed(doc.getContent());
vectorStore.upsert(doc.getId(), embedding, doc.getMetadata());
}
log.info("Updated {} stale embeddings", updatedDocs.size());
}
Code Examples
Example 1: Basic Text Embedding
@Service
public class BasicEmbeddingService {
private final EmbeddingModel embeddingModel;
public float[] embedText(String text) {
List<Double> embedding = embeddingModel.embed(text);
return toFloatArray(embedding);
}
public double cosineSimilarity(String text1, String text2) {
float[] vec1 = embedText(text1);
float[] vec2 = embedText(text2);
return dotProduct(vec1, vec2); // Assumes normalized vectors
}
private float[] toFloatArray(List<Double> list) {
float[] array = new float[list.size()];
for (int i = 0; i < list.size(); i++) {
array[i] = list.get(i).floatValue();
}
return array;
}
}
✅ Good for: Simple similarity checks, one-off embeddings
❌ Not good for: High-volume production (no batching)
Example 2: Batch Embedding for RAG Ingestion
@Service
public class DocumentIngestionService {
private final EmbeddingModel embeddingModel;
private final VectorStore vectorStore;
public void ingestDocuments(List<Document> documents) {
// Chunk documents
List<String> chunks = documents.stream()
.flatMap(doc -> chunkDocument(doc, 512))
.collect(Collectors.toList());
// Batch embed (up to 2048 at a time for OpenAI)
for (int i = 0; i < chunks.size(); i += 2048) {
List<String> batch = chunks.subList(
i, Math.min(i + 2048, chunks.size())
);
EmbeddingResponse response = embeddingModel.call(
new EmbeddingRequest(batch, null)
);
// Store in vector database
for (int j = 0; j < batch.size(); j++) {
vectorStore.add(
new org.springframework.ai.vectorstore.Document(
batch.get(j),
Map.of("chunk_index", i + j),
toFloatArray(response.getResults().get(j).getOutput())
)
);
}
}
}
}
✅ Good for: RAG pipelines, bulk ingestion
❌ Not good for: Real-time embedding (use pre-computed vectors)
Example 3: Semantic Deduplication
@Service
public class DeduplicationService {
private final EmbeddingModel embeddingModel;
private static final double SIMILARITY_THRESHOLD = 0.95;
public List<Document> deduplicate(List<Document> documents) {
List<Document> unique = new ArrayList<>();
for (Document doc : documents) {
float[] embedding = embedText(doc.getContent());
boolean isDuplicate = unique.stream()
.anyMatch(existing -> {
float[] existingEmbedding = embedText(existing.getContent());
return cosineSimilarity(embedding, existingEmbedding) > SIMILARITY_THRESHOLD;
});
if (!isDuplicate) {
unique.add(doc);
}
}
return unique;
}
}
✅ Good for: Content moderation, duplicate detection
❌ Not good for: Large datasets (O(n²) complexity; use vector DB)
Example 4: Multi-Model Fallback
@Service
public class ResilientEmbeddingService {
private final EmbeddingModel primaryModel;
private final EmbeddingModel fallbackModel;
@CircuitBreaker(name = "embedding", fallbackMethod = "fallbackEmbed")
public List<Double> embed(String text) {
return primaryModel.embed(text);
}
private List<Double> fallbackEmbed(String text, Exception e) {
log.warn("Primary embedding model failed, using fallback", e);
return fallbackModel.embed(text);
}
}
✅ Good for: Production reliability
❌ Not good for: Dimensionality must match (or re-index everything)
Example 5: Hybrid Search (Keyword + Semantic)
@Service
public class HybridSearchService {
private final EmbeddingModel embeddingModel;
private final VectorStore vectorStore;
private final FullTextSearchEngine fullTextSearch;
public List<Document> search(String query, int topK) {
// 1. Semantic search via embeddings
float[] queryEmbedding = embedText(query);
List<Document> semanticResults = vectorStore.similaritySearch(
SearchRequest.query(queryEmbedding).withTopK(topK)
);
// 2. Keyword search via full-text index
List<Document> keywordResults = fullTextSearch.search(query, topK);
// 3. Merge and rerank
return mergeAndRerank(semanticResults, keywordResults, query, topK);
}
private List<Document> mergeAndRerank(
List<Document> semantic,
List<Document> keyword,
String query,
int topK
) {
// Reciprocal Rank Fusion (RRF)
Map<String, Double> scores = new HashMap<>();
for (int i = 0; i < semantic.size(); i++) {
String id = semantic.get(i).getId();
scores.merge(id, 1.0 / (60 + i), Double::sum);
}
for (int i = 0; i < keyword.size(); i++) {
String id = keyword.get(i).getId();
scores.merge(id, 1.0 / (60 + i), Double::sum);
}
return scores.entrySet().stream()
.sorted(Map.Entry.<String, Double>comparingByValue().reversed())
.limit(topK)
.map(entry -> findDocumentById(entry.getKey()))
.collect(Collectors.toList());
}
}
✅ Good for: Best of both worlds (semantic + keyword)
❌ Not good for: Real-time (needs caching/pre-computation)
Anti-Patterns
❌ Re-Embedding Static Content on Every Query
// DON'T: Recompute embeddings for static documents
public List<Document> search(String query) {
float[] queryEmbedding = embedText(query);
List<Document> allDocs = documentRepository.findAll();
for (Document doc : allDocs) {
doc.setEmbedding(embedText(doc.getContent())); // Wasteful!
}
return findTopK(queryEmbedding, allDocs);
}
Why: Embeddings are deterministic; recomputing wastes time and money.
✅ DO: Pre-compute and store embeddings
// At ingestion time
vectorStore.add(new Document(
content,
metadata,
embeddingModel.embed(content)
));
// At query time
return vectorStore.similaritySearch(query);
❌ Using Wrong Similarity Metric
// DON'T: Use Euclidean distance on non-normalized vectors
double distance = euclideanDistance(vec1, vec2);
Why: Euclidean distance is sensitive to vector magnitude; cosine similarity is better for text.
✅ DO: Use cosine similarity for text embeddings
double similarity = cosineSimilarity(vec1, vec2);
❌ Embedding Full Documents Without Chunking
// DON'T: Embed entire 50-page document
String fullDoc = readEntireDocument(); // 50,000 tokens
float[] embedding = embedText(fullDoc); // Truncated at 8192 tokens!
Why: Embedding models have token limits (e.g., 8192 for OpenAI); long texts lose information.
✅ DO: Chunk documents before embedding
List<String> chunks = chunkDocument(fullDoc, 512); // 512 tokens each
for (String chunk : chunks) {
vectorStore.add(new Document(chunk, metadata, embedText(chunk)));
}
❌ Ignoring Embedding Model Version Changes
// DON'T: Switch embedding models without re-indexing
// Old: text-embedding-ada-002 (1536 dims)
// New: text-embedding-3-large (3072 dims)
// Vectors are incompatible!
Why: Different models produce incomparable embeddings; similarity search fails.
✅ DO: Version your vector stores
@Bean("embeddingModelV2")
public EmbeddingModel newModel() { ... }
@Bean("vectorStoreV2")
public VectorStore newVectorStore() { ... }
// Migrate data with re-embedding
migrationService.reindexWithNewModel(embeddingModelV2, vectorStoreV2);
Testing Strategies
Unit Testing with Fixed Vectors
@Test
void shouldComputeCosineSimilarity() {
float[] vec1 = {1.0f, 0.0f, 0.0f};
float[] vec2 = {1.0f, 0.0f, 0.0f};
double similarity = cosineSimilarity(vec1, vec2);
assertEquals(1.0, similarity, 0.001); // Identical vectors
}
Integration Testing with Local Models
@SpringBootTest
@TestPropertySource(properties = {
"spring.ai.ollama.base-url=http://localhost:11434",
"spring.ai.ollama.embedding.model=nomic-embed-text"
})
class EmbeddingServiceIntegrationTest {
@Autowired
private EmbeddingModel embeddingModel;
@Test
void shouldEmbedText() {
List<Double> embedding = embeddingModel.embed("Hello world");
assertNotNull(embedding);
assertEquals(768, embedding.size()); // nomic-embed-text dims
}
}
Similarity Threshold Tuning
@Test
void shouldTuneSimilarityThreshold() {
List<Pair<String, String>> goldenPairs = loadGoldenDataset();
for (double threshold = 0.7; threshold <= 0.99; threshold += 0.01) {
int truePositives = 0;
int falsePositives = 0;
for (Pair<String, String> pair : goldenPairs) {
double similarity = embeddingService.similarity(
pair.getFirst(), pair.getSecond()
);
if (similarity >= threshold) {
if (pair.isMatch()) truePositives++;
else falsePositives++;
}
}
double precision = (double) truePositives / (truePositives + falsePositives);
log.info("Threshold: {}, Precision: {}", threshold, precision);
}
}
Performance Considerations
| Concern | Strategy |
|---|---|
| Latency | Batch embeddings; use async calls; cache results |
| Cost | Choose smaller models (1536 vs 3072 dims); batch requests |
| Storage | Use dimensionality reduction (PCA) if recall permits |
| Recall | Use larger models or hybrid search (keyword + semantic) |
| Index Size | Shard vector store; use approximate NN (HNSW, IVF) |
Observability
Metrics to Track
@Component
@Aspect
public class EmbeddingMetrics {
private final MeterRegistry registry;
@Around("execution(* org.springframework.ai.embedding.EmbeddingModel.call(..))")
public Object trackEmbedding(ProceedingJoinPoint joinPoint) throws Throwable {
Timer.Sample sample = Timer.start(registry);
try {
EmbeddingResponse response = (EmbeddingResponse) joinPoint.proceed();
// Track batch size
registry.counter("embedding.batch.size")
.increment(response.getResults().size());
// Track total dimensions
int dims = response.getResults().get(0).getOutput().size();
registry.gauge("embedding.dimensions", dims);
return response;
} finally {
sample.stop(registry.timer("embedding.duration"));
}
}
}
References
- Spring AI Documentation - Embedding Models
- OpenAI Embeddings Guide
- Pinecone: Understanding Embeddings
- MTEB Leaderboard — Embedding model benchmarks
Related Skills
chat-models.md— LLM text generationretrieval.md— VectorStore and similarity searchprompt-templates.md— Prompt engineering for RAGobservability.md— Metrics and tracing