Browse docs
--- title: "Provider Prompt Caching" description: "Reduce token costs by sending cache_control directives to upstream AI providers. Anthropic, OpenAI-compatible, and Google Gemini caching strategies explained." icon: "cog" ---
Overview
Provider prompt caching reduces token spend by telling the upstream LLM provider which parts of your prompt are reusable across requests. When an upstream provider caches prompt segments (system instructions, tool definitions, conversation history), subsequent requests skip re-encoding those tokens and are billed at a lower cached-input rate.
Aurora translates a unified prompt_cache configuration into each provider's native caching mechanism before forwarding the request. This is separate from gateway-level response caching — response caching avoids upstream calls entirely, while prompt caching reduces the cost of the upstream call.
Supported providers
Configuration
Enable provider prompt caching via YAML:
prompt_cache:
mode: "auto" # "auto", "manual", or "off"
system_prompt_cache: true # Mark system prompts for caching (auto mode)
first_message_cache: true # Mark first user message for caching (auto mode)
tools_cache: false # Mark tool definitions for caching (auto mode)
min_tokens_before_cache: 1024 # Min cumulative tokens before breakpointOr via environment variables:
PROMPT_CACHE_MODE=auto
PROMPT_CACHE_SYSTEM=true
PROMPT_CACHE_FIRST_MESSAGE=true
PROMPT_CACHE_TOOLS=false
PROMPT_CACHE_MIN_TOKENS=1024Modes
off: No cache directives are sent to providers. Provider-level caching is
entirely disabled.
auto(default): The gateway automatically insertscache_control
breakpoints at strategic positions based on the flags above:
- System prompt content gets marked for caching.
- The first user message gets marked for caching.
- Tool definitions get marked for caching (if
tools_cache: true).
manual: The gateway forwards onlycache_controldirectives that were
explicitly set on the request body (via UnknownJSONFields). No automatic breakpoints are injected.
How it works per provider
Anthropic
Anthropic's content-block-level cache_control is the most granular. Aurora sets "cache_control": {"type": "ephemeral"} on:
- Each system-prompt content block (when
system_prompt_cache: true) - The first user message content block (when
first_message_cache: true) - A request-level hint for tool definitions (when
tools_cache: true)
The Anthropic response includes cache_read_input_tokens and cache_creation_input_tokens in its usage block, which Aurora tracks in usage logs.
OpenAI-compatible providers
OpenAI and compatible providers (Azure, DeepSeek, xAI, Groq, OpenRouter, etc.) handle caching primarily at the infrastructure level. Aurora forwards cache_control through ExtraFields on messages and the request body.
These providers typically report cached_tokens or prompt_cached_tokens in their usage response, which Aurora maps to cost calculations automatically.
Google Gemini
Gemini's OpenAI-compatible endpoint follows the same ExtraFields pattern as other OpenAI-compatible providers. For the native Gemini API, a separate cachedContent API with TTL-based expiration is required. Aurora logs a recommendation to create a cached content reference explicitly.
Usage tracking
When provider prompt caching is active, Aurora reads cached-token fields from the provider's usage response and includes them in usage logs and cost calculations.
Cached tokens are typically billed at a lower per-token rate than uncached tokens. Aurora's pricing model includes the following cached-token mappings where supported:
cache_read_input_tokens/cache_creation_input_tokens(Anthropic)cached_tokens/prompt_cached_tokens(OpenAI-compatible)cache_read_input_tokens(Mistral)
Cached-token pricing is defined in the provider's YAML block with cached_input_per_mtok:
providers:
anthropic:
claude-sonnet-4-20250514:
pricing:
input_per_mtok: 3.0
cached_input_per_mtok: 0.30
output_per_mtok: 15.0Combining with gateway response cache
Provider prompt caching and gateway-level response caching work together:
- Gateway response cache (exact + semantic) checks for a cached response
before reaching the provider. On a hit, no upstream call is made and no provider prompt caching occurs.
- Provider prompt caching applies only when the request misses the gateway
cache and is forwarded to the upstream provider. It reduces the token cost of that upstream call.
Both can be enabled simultaneously for maximum cost savings.
Dashboard
In the admin dashboard, navigate to Settings → Caching to see a Provider Prompt Caching section with toggles for mode, system prompt caching, first message caching, tools caching, and the minimum token threshold.