Model guide

Pick the right model for the job.

AI model API costs vary based on usage. This page is a snapshot of the leading models across the five categories an AI agent typically calls. Use it to estimate what your agent will cost to run, and pick the model that fits your work.

Last updated 2026-05-11 · Editorial 0–100 ratings, see methodology below.

How we picked these numbers

Prices and capabilities (text) come from the OpenRouter API. OpenRouter is our source of truth because it's the catalog ZTC routes through via OpenClaw. Prices for image, video, TTS and STT come from each provider's pricing page (Artificial Analysis, fal.ai, Replicate, ElevenLabs, Deepgram, AssemblyAI, OpenAI, Google, AWS).

Intelligence Index and output speed (tokens per second) come from the Artificial Analysis leaderboard's steady-state measurements. Aider polyglot coding scores come from the Aider leaderboard. Image and video quality readouts come from Artificial Analysis's category arenas (human-preference voting). Word Error Rate for STT comes from the AA STT leaderboard.

Cost score (the bar in the Cost score column) is computed automatically per modality. We log-scale the prices in each category and map them onto a 10–95 range so the cheapest model lands near 95 and the most expensive near 10. Cost scores are not comparable across modalities (a 90 cost score in text and a 90 in video aren't measuring the same thing).

Overall rating (the bar in the Overall rating column) is editorial on a 0–100 scale. It's anchored on the benchmark numbers above with a small editorial layer (typically ±5–10 points) for things benchmarks don't capture: tool-use reliability on long agent runs, ecosystem maturity, real-world price-quality balance, writing tone. Per-modality anchors:

Text: weighted blend of Intel Index (60%) and Aider polyglot (40%).
Image / video: AA Arena (human-preference) quality, blended with our editorial read.
TTS: Quality ELO where Artificial Analysis publishes it, otherwise editorial.
STT: inverse WER (lower error → higher rating).

Per-category ratings (the chips inside each expanded model panel) are also 0–100. Where a category has an obvious benchmark anchor we use it (Coding → Aider, Reasoning → Intel Index, Accuracy → inverse WER); the rest are editorial. Calibration band: 90+ best-in-class · 80-89 top-tier · 70-79 solid · 60-69 mid-tier · 50-59 below frontier · under 50 niche.

ZTC ratings are editorial. They are not derived from a single benchmark and they are not vendor-neutral. Two reasonable people could disagree on individual scores. The Confluence page INTERNAL/9338881 is the editorial source of truth.

Known gaps and caveats

Aider polyglot lags new releases by several weeks. The latest snapshot rated GPT-5 and Claude Opus 4 family but not every 2026 variant. Estimated values are marked with a tilde.
LiveBench and LMArena ELO not fetched cleanly this pass. Worth a follow-up to add LiveBench reasoning/coding/math sub-scores and LMArena human-preference ELO.
Cached input pricing is offered by several providers (Anthropic, OpenAI) but isn't surfaced uniformly. Treat the listed input price as the uncached rate.
Reasoning-mode billing differs across providers. Some bill reasoning tokens as output, some have separate metering. Long thinking sessions can blow past the headline price.
Image pricing assumes 1024×1024 standard quality. High-resolution or "HD" tiers can be 2–4× the listed price.
Video price-per-minute is a relative metric. Artificial Analysis blends clip length and resolution. Per-clip billing varies by provider and resolution; use it for comparison, not for an exact bill estimate.
TTS quality benchmarks are subjective. Arena ELO comes from human-preference voting and shifts week to week. Audition the top picks for your actual voice and content before committing.
STT WER depends on the test set. Real-world WER varies with audio quality, accents and domain vocabulary. The relative ranking holds; absolute numbers are conservative.
Open weights flag reflects the publishing organisation's intent. A few "open" models have non-commercial licenses. Verify the licence terms before relying on it for production.
Cost score is per-modality, not cross-modality. A 92 cost score in text and a 92 in video are not equivalent in dollars; they're both just "near the cheapest within this category".
The editorial "Cost" rating dimension (in image/video/TTS/STT per-category chips) overlaps with the auto-computed Cost score. We keep both: the editorial Cost is our human read on price-for-quality, the Cost score is the raw price ranking within the category.
Scope is OpenRouter plus selected integrations. Most models are routable via OpenRouter, the catalog ZTC exposes through OpenClaw. A small number of best-in-class direct-provider models (currently ElevenLabs for TTS) are surfaced when ZTC has a first-class integration with the provider. They sit in the same comparison tables but are flagged with an amber via [Provider] chip and a tinted row so you can tell at a glance which path the call takes. Other direct-provider models (Deepgram, AssemblyAI, Ideogram, Midjourney, Runway, etc.) remain excluded until we ship integrations for them.
Snapshot, not live. This page is updated periodically from landing/src/data/model_metrics.json (see the "Last updated" date in the header). Frontier model releases between updates won't show up until the next refresh.

Ready to run yours?

We set up your AI agent with the model that fits your workload, secured and connected to your tools. Live in 48 hours.

See pricing →