Caching.
Multi-layer caching architecture that reduces latency, cuts provider costs, and accelerates repeated and semantically similar queries.
Cache Architecture
Aurora implements a two-layer caching system. L1 provides sub-millisecond exact-match lookups, while L2 extends coverage to semantically similar queries using vector search. Both layers are independently configurable per workflow.
L1: Exact-Match Cache
The L1 cache uses Redis to perform sub-millisecond key-value lookups. The cache key is a hash of the request parameters (model, messages, temperature, etc.). When a cache hit occurs, the response is returned immediately without calling the provider.
- Backend: Redis. Sub-millisecond lookup times.
- Bypass: Clients can skip the cache by sending a
Cache-Control: no-cacheheader. - Response Header: L1 hits return
X-Cache: HIT (exact). Misses returnX-Cache: MISS.
L2: Semantic Cache
The L2 semantic cache stores embeddings of previous requests and performs K-nearest-neighbour search at query time. If a semantically similar request has been cached, the stored response is returned — even if the exact text differs.
Similarity Threshold
0.92 (default)
Minimum cosine similarity score for a KNN match. Higher values require closer semantic matches.
TTL
3600s (default)
Time-to-live for cached entries. Configurable per workflow.
Vector Store
Qdrant, pgvector, Pinecone, Weaviate
Pluggable vector database backends. Choose based on your infrastructure.
Embedding Model
configurable
Which embedding model to use for generating request vectors.
cache:
l2:
enabled: true
vector_store: "qdrant"
similarity_threshold: 0.92
ttl: 3600
embedding_model: "text-embedding-3-small"Model Registry Cache
In addition to request caching, Aurora maintains a model registry cache that stores provider model metadata (available models, pricing, capabilities). This reduces startup time and minimises requests to provider /models endpoints. The registry cache TTL is independently configurable.
Cache Analytics
Aurora exposes a dedicated /cache/analytics endpoint that provides hit-rate, miss-rate, eviction-count, and average-lookup-time metrics for both L1 and L2 caches. These metrics help you tune thresholds and TTLs for optimal performance and cost savings.