Caching.

Multi-layer caching architecture that reduces latency, cuts provider costs, and accelerates repeated and semantically similar queries.

Cache Architecture

Aurora implements a two-layer caching system. L1 provides sub-millisecond exact-match lookups, while L2 extends coverage to semantically similar queries using vector search. Both layers are independently configurable per workflow.

L1: Exact-Match Cache

The L1 cache uses Redis to perform sub-millisecond key-value lookups. The cache key is a hash of the request parameters (model, messages, temperature, etc.). When a cache hit occurs, the response is returned immediately without calling the provider.

Backend: Redis. Sub-millisecond lookup times.
Bypass: Clients can skip the cache by sending a Cache-Control: no-cache header.
Response Header: L1 hits return X-Cache: HIT (exact). Misses return X-Cache: MISS.

L2: Semantic Cache

The L2 semantic cache stores embeddings of previous requests and performs K-nearest-neighbour search at query time. If a semantically similar request has been cached, the stored response is returned — even if the exact text differs.

Similarity Threshold

0.92 (default)

Minimum cosine similarity score for a KNN match. Higher values require closer semantic matches.

TTL

3600s (default)

Time-to-live for cached entries. Configurable per workflow.

Vector Store

Qdrant, pgvector, Pinecone, Weaviate

Pluggable vector database backends. Choose based on your infrastructure.

Embedding Model

configurable

Which embedding model to use for generating request vectors.

cache:
  l2:
    enabled: true
    vector_store: "qdrant"
    similarity_threshold: 0.92
    ttl: 3600
    embedding_model: "text-embedding-3-small"

Model Registry Cache

In addition to request caching, Aurora maintains a model registry cache that stores provider model metadata (available models, pricing, capabilities). This reduces startup time and minimises requests to provider /models endpoints. The registry cache TTL is independently configurable.

Cache Analytics

Aurora exposes a dedicated /cache/analytics endpoint that provides hit-rate, miss-rate, eviction-count, and average-lookup-time metrics for both L1 and L2 caches. These metrics help you tune thresholds and TTLs for optimal performance and cost savings.