Embedding Models

Overview

Embedding models convert text into dense vector representations (e.g., float[1536]) that capture semantic meaning. Spring AI’s EmbeddingModel abstraction enables vector-based similarity search, clustering, classification, and retrieval-augmented generation (RAG) pipelines. Embeddings are the foundation of semantic search and are essential for grounding LLM responses in domain-specific knowledge.

Key Concepts

EmbeddingModel Interface

public interface EmbeddingModel extends Model<EmbeddingRequest, EmbeddingResponse> {
    EmbeddingResponse call(EmbeddingRequest request);
    
    // Convenience method
    default List<Double> embed(String text) {
        return call(new EmbeddingRequest(List.of(text), null))
            .getResults()
            .get(0)
            .getOutput();
    }
}

Embedding Dimensionality Tradeoffs

┌─────────────────────────────────────────────────────────────┐
│            Embedding Model Comparison                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│ Model                Dims   Cost/1M  Recall  Use Case        │
│ ─────                ────   ────────  ──────  ────────       │
│                                                              │
│ text-embedding-3-small  1536   $0.02   Good   High volume   │
│ text-embedding-3-large  3072   $0.13   Better High accuracy │
│ text-embedding-ada-002  1536   $0.10   Good   Legacy        │
│ voyage-2               1024   $0.10   Good   Specialized    │
│ e5-mistral-7b-instruct  4096   Free    Best   Local/private │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Dimensionality Impact:

Higher dimensions → Better recall, higher storage cost, slower search
Lower dimensions → Lower storage, faster search, potentially lower recall

Similarity Metrics

┌──────────────────────────────────────────────────────┐
│          Vector Similarity Metrics                    │
├──────────────────────────────────────────────────────┤
│                                                       │
│  COSINE SIMILARITY (Most Common)                      │
│  ─────────────────                                    │
│  Range: [-1, 1]   (1 = identical, 0 = orthogonal)     │
│  Use: Text, semantic search                           │
│  Formula: A · B / (||A|| * ||B||)                     │
│                                                       │
│  EUCLIDEAN DISTANCE                                   │
│  ──────────────────                                   │
│  Range: [0, ∞]    (0 = identical, ∞ = very different) │
│  Use: Image embeddings, when magnitude matters        │
│  Formula: sqrt(Σ(A_i - B_i)²)                         │
│                                                       │
│  DOT PRODUCT                                          │
│  ───────────                                          │
│  Range: [-∞, ∞]                                       │
│  Use: When vectors are normalized                     │
│  Formula: Σ(A_i * B_i)                                │
│                                                       │
└──────────────────────────────────────────────────────┘

Best Practices

1. Choose Embedding Model Based on Use Case

Match model dimensions and cost to retrieval requirements.

@Configuration
public class EmbeddingConfig {
    
    @Bean
    @ConditionalOnProperty(name = "embedding.mode", havingValue = "high-volume")
    public EmbeddingModel smallEmbeddingModel() {
        return new OpenAiEmbeddingModel(
            openAiApi,
            EmbeddingOptions.builder()
                .withModel("text-embedding-3-small") // 1536 dims, cheap
                .build()
        );
    }
    
    @Bean
    @ConditionalOnProperty(name = "embedding.mode", havingValue = "high-accuracy")
    public EmbeddingModel largeEmbeddingModel() {
        return new OpenAiEmbeddingModel(
            openAiApi,
            EmbeddingOptions.builder()
                .withModel("text-embedding-3-large") // 3072 dims, better recall
                .build()
        );
    }
}

2. Batch Embeddings for Efficiency

Embed multiple documents in a single API call to reduce latency and cost.

@Service
public class DocumentEmbeddingService {
    private final EmbeddingModel embeddingModel;
    
    public List<float[]> embedBatch(List<String> documents) {
        // Single API call for up to 2048 documents (OpenAI limit)
        EmbeddingResponse response = embeddingModel.call(
            new EmbeddingRequest(documents, null)
        );
        
        return response.getResults().stream()
            .map(result -> toFloatArray(result.getOutput()))
            .collect(Collectors.toList());
    }
}

3. Normalize Vectors for Cosine Similarity

Most vector databases expect normalized vectors for cosine similarity.

public float[] normalizeVector(List<Double> vector) {
    double magnitude = Math.sqrt(
        vector.stream().mapToDouble(v -> v * v).sum()
    );
    
    return vector.stream()
        .map(v -> (float) (v / magnitude))
        .mapToDouble(Float::doubleValue)
        .toArray();
}

4. Cache Embeddings to Avoid Recomputation

Embeddings for static content should be computed once and stored.

@Service
public class CachedEmbeddingService {
    private final EmbeddingModel embeddingModel;
    private final EmbeddingCache cache;
    
    public float[] embed(String text) {
        return cache.computeIfAbsent(text, key -> {
            List<Double> embedding = embeddingModel.embed(text);
            return toFloatArray(embedding);
        });
    }
}

5. Monitor Embedding Drift for Document Updates

When documents change, re-embed and update the vector store.

@Scheduled(cron = "0 0 2 * * ?") // Daily at 2 AM
public void updateStaleEmbeddings() {
    List<Document> updatedDocs = documentRepository.findUpdatedSince(
        LocalDateTime.now().minusDays(1)
    );
    
    for (Document doc : updatedDocs) {
        float[] embedding = embeddingService.embed(doc.getContent());
        vectorStore.upsert(doc.getId(), embedding, doc.getMetadata());
    }
    
    log.info("Updated {} stale embeddings", updatedDocs.size());
}

Code Examples

Example 1: Basic Text Embedding

@Service
public class BasicEmbeddingService {
    private final EmbeddingModel embeddingModel;
    
    public float[] embedText(String text) {
        List<Double> embedding = embeddingModel.embed(text);
        return toFloatArray(embedding);
    }
    
    public double cosineSimilarity(String text1, String text2) {
        float[] vec1 = embedText(text1);
        float[] vec2 = embedText(text2);
        
        return dotProduct(vec1, vec2); // Assumes normalized vectors
    }
    
    private float[] toFloatArray(List<Double> list) {
        float[] array = new float[list.size()];
        for (int i = 0; i < list.size(); i++) {
            array[i] = list.get(i).floatValue();
        }
        return array;
    }
}

✅ Good for: Simple similarity checks, one-off embeddings
❌ Not good for: High-volume production (no batching)

Example 2: Batch Embedding for RAG Ingestion

@Service
public class DocumentIngestionService {
    private final EmbeddingModel embeddingModel;
    private final VectorStore vectorStore;
    
    public void ingestDocuments(List<Document> documents) {
        // Chunk documents
        List<String> chunks = documents.stream()
            .flatMap(doc -> chunkDocument(doc, 512))
            .collect(Collectors.toList());
        
        // Batch embed (up to 2048 at a time for OpenAI)
        for (int i = 0; i < chunks.size(); i += 2048) {
            List<String> batch = chunks.subList(
                i, Math.min(i + 2048, chunks.size())
            );
            
            EmbeddingResponse response = embeddingModel.call(
                new EmbeddingRequest(batch, null)
            );
            
            // Store in vector database
            for (int j = 0; j < batch.size(); j++) {
                vectorStore.add(
                    new org.springframework.ai.vectorstore.Document(
                        batch.get(j),
                        Map.of("chunk_index", i + j),
                        toFloatArray(response.getResults().get(j).getOutput())
                    )
                );
            }
        }
    }
}

✅ Good for: RAG pipelines, bulk ingestion
❌ Not good for: Real-time embedding (use pre-computed vectors)

Example 3: Semantic Deduplication

@Service
public class DeduplicationService {
    private final EmbeddingModel embeddingModel;
    private static final double SIMILARITY_THRESHOLD = 0.95;
    
    public List<Document> deduplicate(List<Document> documents) {
        List<Document> unique = new ArrayList<>();
        
        for (Document doc : documents) {
            float[] embedding = embedText(doc.getContent());
            
            boolean isDuplicate = unique.stream()
                .anyMatch(existing -> {
                    float[] existingEmbedding = embedText(existing.getContent());
                    return cosineSimilarity(embedding, existingEmbedding) > SIMILARITY_THRESHOLD;
                });
            
            if (!isDuplicate) {
                unique.add(doc);
            }
        }
        
        return unique;
    }
}

✅ Good for: Content moderation, duplicate detection
❌ Not good for: Large datasets (O(n²) complexity; use vector DB)

Example 4: Multi-Model Fallback

@Service
public class ResilientEmbeddingService {
    private final EmbeddingModel primaryModel;
    private final EmbeddingModel fallbackModel;
    
    @CircuitBreaker(name = "embedding", fallbackMethod = "fallbackEmbed")
    public List<Double> embed(String text) {
        return primaryModel.embed(text);
    }
    
    private List<Double> fallbackEmbed(String text, Exception e) {
        log.warn("Primary embedding model failed, using fallback", e);
        return fallbackModel.embed(text);
    }
}

✅ Good for: Production reliability
❌ Not good for: Dimensionality must match (or re-index everything)

Example 5: Hybrid Search (Keyword + Semantic)

@Service
public class HybridSearchService {
    private final EmbeddingModel embeddingModel;
    private final VectorStore vectorStore;
    private final FullTextSearchEngine fullTextSearch;
    
    public List<Document> search(String query, int topK) {
        // 1. Semantic search via embeddings
        float[] queryEmbedding = embedText(query);
        List<Document> semanticResults = vectorStore.similaritySearch(
            SearchRequest.query(queryEmbedding).withTopK(topK)
        );
        
        // 2. Keyword search via full-text index
        List<Document> keywordResults = fullTextSearch.search(query, topK);
        
        // 3. Merge and rerank
        return mergeAndRerank(semanticResults, keywordResults, query, topK);
    }
    
    private List<Document> mergeAndRerank(
        List<Document> semantic,
        List<Document> keyword,
        String query,
        int topK
    ) {
        // Reciprocal Rank Fusion (RRF)
        Map<String, Double> scores = new HashMap<>();
        
        for (int i = 0; i < semantic.size(); i++) {
            String id = semantic.get(i).getId();
            scores.merge(id, 1.0 / (60 + i), Double::sum);
        }
        
        for (int i = 0; i < keyword.size(); i++) {
            String id = keyword.get(i).getId();
            scores.merge(id, 1.0 / (60 + i), Double::sum);
        }
        
        return scores.entrySet().stream()
            .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
            .limit(topK)
            .map(entry -> findDocumentById(entry.getKey()))
            .collect(Collectors.toList());
    }
}

✅ Good for: Best of both worlds (semantic + keyword)
❌ Not good for: Real-time (needs caching/pre-computation)

Anti-Patterns

❌ Re-Embedding Static Content on Every Query

// DON'T: Recompute embeddings for static documents
public List<Document> search(String query) {
    float[] queryEmbedding = embedText(query);
    
    List<Document> allDocs = documentRepository.findAll();
    for (Document doc : allDocs) {
        doc.setEmbedding(embedText(doc.getContent())); // Wasteful!
    }
    
    return findTopK(queryEmbedding, allDocs);
}

Why: Embeddings are deterministic; recomputing wastes time and money.

✅ DO: Pre-compute and store embeddings

// At ingestion time
vectorStore.add(new Document(
    content,
    metadata,
    embeddingModel.embed(content)
));

// At query time
return vectorStore.similaritySearch(query);

❌ Using Wrong Similarity Metric

// DON'T: Use Euclidean distance on non-normalized vectors
double distance = euclideanDistance(vec1, vec2);

Why: Euclidean distance is sensitive to vector magnitude; cosine similarity is better for text.

✅ DO: Use cosine similarity for text embeddings

double similarity = cosineSimilarity(vec1, vec2);

❌ Embedding Full Documents Without Chunking

// DON'T: Embed entire 50-page document
String fullDoc = readEntireDocument(); // 50,000 tokens
float[] embedding = embedText(fullDoc); // Truncated at 8192 tokens!

Why: Embedding models have token limits (e.g., 8192 for OpenAI); long texts lose information.

✅ DO: Chunk documents before embedding

List<String> chunks = chunkDocument(fullDoc, 512); // 512 tokens each
for (String chunk : chunks) {
    vectorStore.add(new Document(chunk, metadata, embedText(chunk)));
}

❌ Ignoring Embedding Model Version Changes

// DON'T: Switch embedding models without re-indexing
// Old: text-embedding-ada-002 (1536 dims)
// New: text-embedding-3-large (3072 dims)
// Vectors are incompatible!

Why: Different models produce incomparable embeddings; similarity search fails.

✅ DO: Version your vector stores

@Bean("embeddingModelV2")
public EmbeddingModel newModel() { ... }

@Bean("vectorStoreV2")
public VectorStore newVectorStore() { ... }

// Migrate data with re-embedding
migrationService.reindexWithNewModel(embeddingModelV2, vectorStoreV2);

Testing Strategies

Unit Testing with Fixed Vectors

@Test
void shouldComputeCosineSimilarity() {
    float[] vec1 = {1.0f, 0.0f, 0.0f};
    float[] vec2 = {1.0f, 0.0f, 0.0f};
    
    double similarity = cosineSimilarity(vec1, vec2);
    
    assertEquals(1.0, similarity, 0.001); // Identical vectors
}

Integration Testing with Local Models

@SpringBootTest
@TestPropertySource(properties = {
    "spring.ai.ollama.base-url=http://localhost:11434",
    "spring.ai.ollama.embedding.model=nomic-embed-text"
})
class EmbeddingServiceIntegrationTest {
    @Autowired
    private EmbeddingModel embeddingModel;
    
    @Test
    void shouldEmbedText() {
        List<Double> embedding = embeddingModel.embed("Hello world");
        
        assertNotNull(embedding);
        assertEquals(768, embedding.size()); // nomic-embed-text dims
    }
}

Similarity Threshold Tuning

@Test
void shouldTuneSimilarityThreshold() {
    List<Pair<String, String>> goldenPairs = loadGoldenDataset();
    
    for (double threshold = 0.7; threshold <= 0.99; threshold += 0.01) {
        int truePositives = 0;
        int falsePositives = 0;
        
        for (Pair<String, String> pair : goldenPairs) {
            double similarity = embeddingService.similarity(
                pair.getFirst(), pair.getSecond()
            );
            
            if (similarity >= threshold) {
                if (pair.isMatch()) truePositives++;
                else falsePositives++;
            }
        }
        
        double precision = (double) truePositives / (truePositives + falsePositives);
        log.info("Threshold: {}, Precision: {}", threshold, precision);
    }
}

Performance Considerations

Concern	Strategy
Latency	Batch embeddings; use async calls; cache results
Cost	Choose smaller models (1536 vs 3072 dims); batch requests
Storage	Use dimensionality reduction (PCA) if recall permits
Recall	Use larger models or hybrid search (keyword + semantic)
Index Size	Shard vector store; use approximate NN (HNSW, IVF)

Observability

Metrics to Track

@Component
@Aspect
public class EmbeddingMetrics {
    private final MeterRegistry registry;
    
    @Around("execution(* org.springframework.ai.embedding.EmbeddingModel.call(..))")
    public Object trackEmbedding(ProceedingJoinPoint joinPoint) throws Throwable {
        Timer.Sample sample = Timer.start(registry);
        
        try {
            EmbeddingResponse response = (EmbeddingResponse) joinPoint.proceed();
            
            // Track batch size
            registry.counter("embedding.batch.size")
                .increment(response.getResults().size());
            
            // Track total dimensions
            int dims = response.getResults().get(0).getOutput().size();
            registry.gauge("embedding.dimensions", dims);
            
            return response;
        } finally {
            sample.stop(registry.timer("embedding.duration"));
        }
    }
}

References

Spring AI Documentation - Embedding Models
OpenAI Embeddings Guide
Pinecone: Understanding Embeddings
MTEB Leaderboard — Embedding model benchmarks

chat-models.md — LLM text generation
retrieval.md — VectorStore and similarity search
prompt-templates.md — Prompt engineering for RAG
observability.md — Metrics and tracing

Embedding Models

Embedding Models

Overview

Key Concepts

EmbeddingModel Interface

Embedding Dimensionality Tradeoffs

Similarity Metrics

Best Practices

1. Choose Embedding Model Based on Use Case

2. Batch Embeddings for Efficiency

3. Normalize Vectors for Cosine Similarity

4. Cache Embeddings to Avoid Recomputation

5. Monitor Embedding Drift for Document Updates

Code Examples

Example 1: Basic Text Embedding

Example 2: Batch Embedding for RAG Ingestion

Example 3: Semantic Deduplication

Example 4: Multi-Model Fallback

Example 5: Hybrid Search (Keyword + Semantic)

Anti-Patterns

❌ Re-Embedding Static Content on Every Query

❌ Using Wrong Similarity Metric

❌ Embedding Full Documents Without Chunking

❌ Ignoring Embedding Model Version Changes

Testing Strategies

Unit Testing with Fixed Vectors

Integration Testing with Local Models

Similarity Threshold Tuning

Performance Considerations

Observability

Metrics to Track

References

Related Skills