Skip to main content
Version: Next 🚧

Anthropic Provider

AnthropicProvider wraps the official anthropic SDK against the Messages API. It supports streaming, extended thinking, prompt caching, and tool use.

Construction​

from cubepi.providers.anthropic import AnthropicProvider

provider = AnthropicProvider(
api_key="sk-ant-â€Ļ", # or read from os.environ["ANTHROPIC_API_KEY"]
base_url=None, # set to point at a proxy / Bedrock-compatible endpoint
cache_retention="short", # "short" (5 min, default) | "long" (1 h) | "none"
)

api_key=None lets the underlying SDK read ANTHROPIC_API_KEY from the environment.

The Model​

from cubepi import Model

model = Model(
id="claude-sonnet-4-5-20250929",
provider="anthropic",
reasoning=True, # enables thinking levels (see below)
max_tokens=8192, # response cap
context_window=200_000, # hard model limit
temperature=0.7,
)

The id is the model name exactly as you'd pass it to the SDK. The provider string is a free-form label used by CubePi internals — keep it stable across your codebase but it doesn't have to match "anthropic" literally.

Extended thinking​

CubePi maps a ThinkingLevel enum onto Anthropic's budget_tokens:

LevelDefault budget
"off"thinking disabled
"minimal"1024
"low"2048
"medium"8192
"high"16384
"xhigh"clamped to "high"

Set it per-agent:

agent = Agent(provider=provider, model=model, thinking="medium")

To change the per-level budgets, supply a CapabilityDescriptor with a reasoning_level of kind="int_budget" at construction — its level_budgets map is the single source of truth for budget values.

from cubepi import CapabilityDescriptor, ReasoningLevelSpec
from cubepi.providers.anthropic import AnthropicProvider

provider = AnthropicProvider(
api_key="sk-ant-â€Ļ",
capability=CapabilityDescriptor(
reasoning_off_payload={"thinking": {"type": "disabled"}},
reasoning_on_payload={"thinking": {"type": "enabled"}},
reasoning_level=ReasoningLevelSpec(
path="thinking.budget_tokens",
kind="int_budget",
level_budgets={"off": 0, "minimal": 1024, "low": 4096,
"medium": 12288, "high": 16384, "xhigh": 16384},
),
),
)
Changed in 0.5

AnthropicProvider no longer honors StreamOptions(thinking_budgets=â€Ļ) — the capability descriptor's level_budgets is now the only way to override budgets.

When thinking is on, CubePi omits temperature because the Anthropic API rejects non-default temperatures alongside extended thinking (feature compatibility). Set Model.temperature to the value you want when thinking is off; CubePi handles the rest.

Thinking content streams as thinking_start / thinking_delta / thinking_end events and ends up in AssistantMessage.content as ThinkingContent blocks, preserved on subsequent turns so the model keeps continuity.

Prompt caching​

By default the provider marks three cache breakpoints on each request:

  • The system prompt (most stable).
  • The last tool definition (changes rarely).
  • The last message (the cache moves forward each turn so prior history stays warm).

Cache retention is "short" (5 minutes, free). Bump to "long" if your turns are slower than that:

AnthropicProvider(api_key=â€Ļ, cache_retention="long") # 1-hour TTL
AnthropicProvider(api_key=â€Ļ, cache_retention="none") # disable entirely

The Usage object on each AssistantMessage reports cache_read_tokens and cache_write_tokens so you can see your hit rate.

For custom cache strategies (a different breakpoint policy), implement the CacheMarkerPolicy Protocol and pass cache_policy=â€Ļ. The default policy lives at cubepi.providers.anthropic.DefaultCacheMarkerPolicy.

Custom payloads with on_payload​

on_payload lets you inspect or replace the request dict right before it's sent:

async def my_payload(payload, model):
payload.setdefault("metadata", {})["user_id"] = "u-42"
return payload # return None or no return to keep the original

agent = Agent(provider=provider, model=model, on_payload=my_payload)

Use this for: adding metadata.user_id (for billing), forcing beta-header flags, or tracking the exact payloads you sent for a debug pane.

Custom response handling with on_response​

on_response fires after the HTTP response is received (status, headers), before streaming begins:

async def my_response(resp, model):
if resp.status >= 400:
logger.warning("bad status %s", resp.status)
rate = resp.headers.get("anthropic-ratelimit-requests-remaining")
if rate is not None:
metrics.gauge("rate_remaining", int(rate))

agent = Agent(provider=provider, model=model, on_response=my_response)

Both callbacks may be sync or async.

Pointing at Bedrock / Vertex / proxies​

The Anthropic SDK accepts a base_url; CubePi forwards it:

provider = AnthropicProvider(
api_key="â€Ļ",
base_url="https://my-litellm.internal/v1",
)

For Bedrock specifically, use the anthropic-bedrock adapter directly and inject it via a custom provider.

Common pitfalls​

  • temperature ignored — Expected. CubePi drops it when thinking is on; that's an API constraint, not a bug.
  • xhigh looks the same as high — Anthropic doesn't expose a budget tier above high, so CubePi clamps xhigh down. The token budget is the same.
  • Cache misses you didn't expect — Caches are keyed by content + ttl. Changing the system prompt invalidates everything; changing the tool list invalidates from the tools onward. Make those two stable across turns to maximise hits.
  • anthropic.RateLimitError — Propagates as a stream error event with the SDK's str(exc). Catch in agent_end and decide whether to retry.

See also​