OSS / Policy Controls

Prompt Cache

Browse docs

--- title: "Provider Prompt Caching" description: "Reduce token costs by sending cache_control directives to upstream AI providers. Anthropic, OpenAI-compatible, and Google Gemini caching strategies explained." icon: "cog" ---

Overview

Provider prompt caching reduces token spend by telling the upstream LLM provider which parts of your prompt are reusable across requests. When an upstream provider caches prompt segments (system instructions, tool definitions, conversation history), subsequent requests skip re-encoding those tokens and are billed at a lower cached-input rate.

Aurora translates a unified prompt_cache configuration into each provider's native caching mechanism before forwarding the request. This is separate from gateway-level response caching â€” response caching avoids upstream calls entirely, while prompt caching reduces the cost of the upstream call.

Supported providers

Provider	Mechanism	Cached token fields
Anthropic	Content-block `cache_control` (`ephemeral`)	`cache_read_input_tokens`, `cache_creation_input_tokens`
OpenAI / Azure	Automatic â€” providers handle caching; ExtraFields round-trip for explicit `cache_control`	`cached_tokens`, `prompt_cached_tokens`
Google Gemini	OpenAI-compatible ExtraFields path; native `cachedContent` API (TTL-based)	`cached_tokens`
DeepSeek	Automatic â€” same as OpenAI-compatible	`cached_tokens`
xAI / Grok	Automatic â€” same as OpenAI-compatible	`cached_tokens`
OpenRouter	Automatic â€” same as OpenAI-compatible	`cached_tokens`

Configuration

Enable provider prompt caching via YAML:

yaml

prompt_cache:
  mode: "auto"           # "auto", "manual", or "off"
  system_prompt_cache: true   # Mark system prompts for caching (auto mode)
  first_message_cache: true   # Mark first user message for caching (auto mode)
  tools_cache: false          # Mark tool definitions for caching (auto mode)
  min_tokens_before_cache: 1024  # Min cumulative tokens before breakpoint

Or via environment variables:

env

PROMPT_CACHE_MODE=auto
PROMPT_CACHE_SYSTEM=true
PROMPT_CACHE_FIRST_MESSAGE=true
PROMPT_CACHE_TOOLS=false
PROMPT_CACHE_MIN_TOKENS=1024

Modes

off: No cache directives are sent to providers. Provider-level caching is

entirely disabled.

auto (default): The gateway automatically inserts cache_control

breakpoints at strategic positions based on the flags above:

System prompt content gets marked for caching.
The first user message gets marked for caching.
Tool definitions get marked for caching (if tools_cache: true).

manual: The gateway forwards only cache_control directives that were

explicitly set on the request body (via UnknownJSONFields). No automatic breakpoints are injected.

How it works per provider

Anthropic

Anthropic's content-block-level cache_control is the most granular. Aurora sets "cache_control": {"type": "ephemeral"} on:

Each system-prompt content block (when system_prompt_cache: true)
The first user message content block (when first_message_cache: true)
A request-level hint for tool definitions (when tools_cache: true)

The Anthropic response includes cache_read_input_tokens and cache_creation_input_tokens in its usage block, which Aurora tracks in usage logs.

OpenAI-compatible providers

OpenAI and compatible providers (Azure, DeepSeek, xAI, Groq, OpenRouter, etc.) handle caching primarily at the infrastructure level. Aurora forwards cache_control through ExtraFields on messages and the request body.

These providers typically report cached_tokens or prompt_cached_tokens in their usage response, which Aurora maps to cost calculations automatically.

Google Gemini

Gemini's OpenAI-compatible endpoint follows the same ExtraFields pattern as other OpenAI-compatible providers. For the native Gemini API, a separate cachedContent API with TTL-based expiration is required. Aurora logs a recommendation to create a cached content reference explicitly.

Usage tracking

When provider prompt caching is active, Aurora reads cached-token fields from the provider's usage response and includes them in usage logs and cost calculations.

Cached tokens are typically billed at a lower per-token rate than uncached tokens. Aurora's pricing model includes the following cached-token mappings where supported:

cache_read_input_tokens / cache_creation_input_tokens (Anthropic)
cached_tokens / prompt_cached_tokens (OpenAI-compatible)
cache_read_input_tokens (Mistral)

Cached-token pricing is defined in the provider's YAML block with cached_input_per_mtok:

yaml

providers:
  anthropic:
    claude-sonnet-4-20250514:
      pricing:
        input_per_mtok: 3.0
        cached_input_per_mtok: 0.30
        output_per_mtok: 15.0

Combining with gateway response cache

Provider prompt caching and gateway-level response caching work together:

Gateway response cache (exact + semantic) checks for a cached response

before reaching the provider. On a hit, no upstream call is made and no provider prompt caching occurs.

Provider prompt caching applies only when the request misses the gateway

cache and is forwarded to the upstream provider. It reduces the token cost of that upstream call.

Both can be enabled simultaneously for maximum cost savings.

Dashboard

In the admin dashboard, navigate to Settings â†’ Caching to see a Provider Prompt Caching section with toggles for mode, system prompt caching, first message caching, tools caching, and the minimum token threshold.

← All docs

OSS / Policy Controls

Prompt Cache

Browse docs

Overview

Supported providers

Provider	Mechanism	Cached token fields
Anthropic	Content-block `cache_control` (`ephemeral`)	`cache_read_input_tokens`, `cache_creation_input_tokens`
OpenAI / Azure	Automatic â€” providers handle caching; ExtraFields round-trip for explicit `cache_control`	`cached_tokens`, `prompt_cached_tokens`
Google Gemini	OpenAI-compatible ExtraFields path; native `cachedContent` API (TTL-based)	`cached_tokens`
DeepSeek	Automatic â€” same as OpenAI-compatible	`cached_tokens`
xAI / Grok	Automatic â€” same as OpenAI-compatible	`cached_tokens`
OpenRouter	Automatic â€” same as OpenAI-compatible	`cached_tokens`

Configuration

Enable provider prompt caching via YAML:

yaml

prompt_cache:
  mode: "auto"           # "auto", "manual", or "off"
  system_prompt_cache: true   # Mark system prompts for caching (auto mode)
  first_message_cache: true   # Mark first user message for caching (auto mode)
  tools_cache: false          # Mark tool definitions for caching (auto mode)
  min_tokens_before_cache: 1024  # Min cumulative tokens before breakpoint

Or via environment variables:

env

PROMPT_CACHE_MODE=auto
PROMPT_CACHE_SYSTEM=true
PROMPT_CACHE_FIRST_MESSAGE=true
PROMPT_CACHE_TOOLS=false
PROMPT_CACHE_MIN_TOKENS=1024

Modes

off: No cache directives are sent to providers. Provider-level caching is

entirely disabled.

auto (default): The gateway automatically inserts cache_control

breakpoints at strategic positions based on the flags above:

System prompt content gets marked for caching.
The first user message gets marked for caching.
Tool definitions get marked for caching (if tools_cache: true).

manual: The gateway forwards only cache_control directives that were

explicitly set on the request body (via UnknownJSONFields). No automatic breakpoints are injected.

How it works per provider

Anthropic

Anthropic's content-block-level cache_control is the most granular. Aurora sets "cache_control": {"type": "ephemeral"} on:

Each system-prompt content block (when system_prompt_cache: true)
The first user message content block (when first_message_cache: true)
A request-level hint for tool definitions (when tools_cache: true)

The Anthropic response includes cache_read_input_tokens and cache_creation_input_tokens in its usage block, which Aurora tracks in usage logs.

OpenAI-compatible providers

These providers typically report cached_tokens or prompt_cached_tokens in their usage response, which Aurora maps to cost calculations automatically.

Google Gemini

Usage tracking

When provider prompt caching is active, Aurora reads cached-token fields from the provider's usage response and includes them in usage logs and cost calculations.

Cached tokens are typically billed at a lower per-token rate than uncached tokens. Aurora's pricing model includes the following cached-token mappings where supported:

cache_read_input_tokens / cache_creation_input_tokens (Anthropic)
cached_tokens / prompt_cached_tokens (OpenAI-compatible)
cache_read_input_tokens (Mistral)

Cached-token pricing is defined in the provider's YAML block with cached_input_per_mtok:

yaml

providers:
  anthropic:
    claude-sonnet-4-20250514:
      pricing:
        input_per_mtok: 3.0
        cached_input_per_mtok: 0.30
        output_per_mtok: 15.0

Combining with gateway response cache

Provider prompt caching and gateway-level response caching work together:

Gateway response cache (exact + semantic) checks for a cached response

before reaching the provider. On a hit, no upstream call is made and no provider prompt caching occurs.

Provider prompt caching applies only when the request misses the gateway

cache and is forwarded to the upstream provider. It reduces the token cost of that upstream call.

Both can be enabled simultaneously for maximum cost savings.