🖥️ Best local setup
Qwen3.5 35B A3B (Reasoning)
2.9M
quality-adjusted tokens per $
💳 Best subscription ⚠️
GPT-5.4 (xhigh)
8.2M
quality-adjusted tokens per $ · estimated
🔌 Best API deal
GLM-4.7 Flash
2.8M
quality-adjusted tokens per $
Supported by Desktop Commander — model-agnostic AI that works with local models, API keys, and subscriptions.
📐 How we calculate this

One number. Three categories.

Not all tokens are equal — a token from a smarter model is worth more than a token from a weaker one. To capture that, we multiply raw token volume by a quality score and divide by cost.

Quality is a z-score-normalized blend of three public benchmarks: Arena text ELO (human preference on general tasks), Arena code ELO (human preference on coding), and Artificial Analysis Intelligence Index (composite of 10 academic evals — MMLU-Pro, GPQA, LiveCodeBench, and others). The result is expressed as a 0–100 percentage.

The result: quality-adjusted tokens per dollar — one metric that puts a $3,500 GPU, a $20/month plan, and a $0.07/M API on the same axis.

🖥️ Local Hardware

tok/s × hours/day × 365 × years × quality %
hardware cost ($) = quality tokens / $

One-time hardware cost amortized over 3 years. tok/s from real-world measurements via Desktop Commander telemetry data (actual user sessions across many hardware combinations). ⚠️ Does not include electricity costs yet (typically $5–$60/month depending on hardware and usage).

💳 Subscription

tokens/week limit × 4 × quality %
$ / month = quality tokens / $

⚠️ Token limits are empirically measured — providers don't publish exact numbers.

🔌 API (pay per token)

1,000,000 tokens × quality %
price/M (input% in + output% out) = quality tokens / $

Adjustable input/output ratio — defaults to 75/25 for general use. Coding workloads are ~90/10 (long context, short output), chat is ~50/50, RAG is ~95/5.

Quality scores: Arena ELO + AA Intelligence Index — the two benchmarks that remain comparable across model generations. See the Benchmarks section for why other benchmarks can't fairly compare across eras.

🎯 Which AI is best for…

Pick the right AI for your use case

Different tasks reward different models. Coding benchmarks don't match writing benchmarks, and a model that crushes academic reasoning might not be the best local option for a 24GB GPU. Here's a quick guide based on the data below — click any link to jump to the filtered view.

💻 Best LLM for coding

Coding tasks are typically input-heavy — you send a lot of context, the model writes a short diff. Use the 90/10 coding I/O ratio to see which models give you the most coding tokens per dollar. Claude Sonnet and Opus dominate Arena's code ELO leaderboard, but open models like Qwen3-Coder and GPT-OSS offer better local value if you have the VRAM. For subscriptions, ChatGPT Business at $30/seat/month crushes the value ranking — measured at ~60M tokens/week.

See coding-ratio ranking →

✍️ Best AI for writing and copywriting

Writing tasks lean output-heavy — short prompts, long generations. For subscriptions, this tips toward plans with generous output allowances; for APIs, it favors models with low output-token pricing. GPT-5 Chat and Claude Opus rank highest on Arena's text leaderboard for general writing quality. If cost is the priority, Gemini Flash and GLM-4.7 Flash offer strong quality at API rates below $1 per million tokens.

Compare writing-heavy ratios →

🖥️ Best local LLM for your GPU

If you're running models locally, the question isn't just which model is best — it's which combination of model, hardware, and quantization. Our data includes real tokens/second measurements from Desktop Commander users across 36 hardware configurations. An RTX 3090 at Q4_K_M runs Llama 3.1 70B around 25 tok/s; an M3 Max with 128GB unified memory handles Qwen3.5 35B A3B at 14+ tok/s. The best local value right now is Qwen3.5 35B A3B (Reasoning) — high quality, modest VRAM, strong throughput.

Browse hardware benchmarks →

🤔 ChatGPT vs Claude — which is better value in 2026?

They're close on quality — Claude Opus 4.7 and GPT-5.5 trade the top spot depending on benchmark. Value differs more by plan than by model. ChatGPT Plus ($20/mo) measures ~190M tokens/week on GPT-5.4 via Codex (multi-flip method, supersedes our earlier 13M figure which used single-flip extrapolation). Claude Pro ($20/mo) measures ~15.6M tokens/week on Opus 4.7 — our first direct Pro measurement, roughly 3× our earlier estimate. Claude Max 20× ($200/mo) measures ~388M tokens/week on Sonnet 4.6 or ~248M on Opus 4.7 — the highest quotas we've seen on a personal plan, but you need to use it heavily to get the value. For light users, Plus wins; for heavy coders, Max 20× wins; for teams, ChatGPT Business at $30/seat is the sleeper pick.

See side-by-side →

💳 ChatGPT Plus vs Pro vs Business — tokens per dollar

ChatGPT Plus ($20/mo) measures ~190M tokens/week on GPT-5.4 (multi-flip method, Apr 24). ChatGPT Business ($30/seat/month) measures ~60M tokens/week on GPT-5.4 — but that number comes from our older single-flip method (Apr 15) and hasn't been re-measured with the newer multi-flip approach yet, so direct Plus-vs-Business comparisons should be treated with caution. ChatGPT Pro ($200/mo) is listed at 66.5M/week on OpenAI's Codex pricing page, not directly measured. The most defensible single finding: if you're a heavy individual user, Plus at $20 delivers genuinely large capacity; if you need guaranteed per-seat quotas for a team, Business is sized for that use case. Pro's raw tokens/$ ratio is worse than either.

See measurement methodology →

⚡ Claude Max 5× vs 20× — is the upgrade worth it?

We measured Claude Max 20× at ~388M tokens/week on Sonnet 4.6 and ~248M tokens/week on Opus 4.7 via Claude Code — Opus has a tighter per-model sub-quota, so it delivers about 64% of what Sonnet gives on the same plan. Both numbers are from multi-flip runs on Apr 24, 2026 (replacing our earlier Sonnet 4.5 single-flip figure of 203M). The 5× plan is estimated at ~97M/week (Sonnet) or ~62M/week (Opus) by ratio (5/20 × measured 20×), not directly measured. At $100 vs $200/month, Max 5× has the better raw tokens/$ ratio if you won't hit the cap. Max 20× wins only if you're doing sustained heavy work — Claude Code all day, multi-agent workflows, or running Opus on large contexts. For most users, Max 5× is the sweet spot; for power users, 20× removes the rate-limit friction.

See quota measurements →

🤖 For AI agents

Use this with your AI agent

This site is website, data, and skill in one repo. You can read it yourself, or point your AI agent at it — the same data and logic, just delegated. Your agent picks up an installable skill that knows how to fetch live numbers and reason about plans.

🧭 Get recommendations from your agent

Install the ai-value-advisor skill once. Then ask your agent things like "which AI should I pay for this month?" or "is Claude Max 20× worth it for me?" — it fetches live data from this site, considers your usage and budget, and recommends with caveats. Works with Claude Code, Cursor, Codex, Copilot, Windsurf, and 30+ other agents via the skills CLI.

npx skills add desktop-commander/best-value-ai

📏 Contribute a measurement

Most subscription plans don't publish their real quotas. The submit-usage-measurement skill in this repo can run a standardized benchmark on your Claude Code or Codex CLI, capture the actual tokens you get, and open a pull request. If you have a plan we haven't measured yet (Claude Pro, ChatGPT Team, an edu/student discount…), this is the easiest way to add it to the dataset.

See both skills on GitHub →

📊 Full data

Explore the numbers

🏆 Ranking — quality-adjusted tokens per dollar

📐 How these numbers are calculated
Local: hardware cost is amortized over the years you set below, assuming you use it the chosen hours/day every day. Idle hardware = wasted capacity. Electricity not yet included ($5–$60/mo typical).
Subscription: assumes you use 100% of your weekly quota. If you only use half, your real value is half what's shown.
API: pay-per-token, no assumptions. What you spend is what you get.
Token limits are empirically measured — providers don't publish exact numbers.

All options ranked by quality-adjusted tokens per dollar

#1 💳 Sub ChatGPT Pro $200 → GPT-5.4 (xhigh)
49.1M
ChatGPT Pro $200 · $200/mo · 📐 est. (64.5%)
#2 💳 Sub ChatGPT Pro $200 → GPT-5.2
43.9M
ChatGPT Pro $200 · $200/mo · 📐 est. (57.7%)
#3 💳 Sub ChatGPT Pro $200 → GPT-5.2 Codex (xhigh)
39.4M
ChatGPT Pro $200 · $200/mo · 📐 est. (51.8%)
#4 💳 Sub ChatGPT Pro $200 → GPT-4o
29.7M
ChatGPT Pro $200 · $200/mo · 📐 est. (39.1%)
#5 💳 Sub ChatGPT Pro $200 → GPT-5.5 (xhigh)
28.6M
ChatGPT Pro $200 · $200/mo · 📐 est. (75.0%)
#6 💳 Sub ChatGPT Plus → GPT-5.4 (xhigh)
24.5M
ChatGPT Plus · $20/mo (64.5%)
#7 💳 Sub ChatGPT Pro $100 → GPT-5.4 (xhigh)
24.5M
ChatGPT Pro $100 · $100/mo · 📐 est. (64.5%)
#8 💳 Sub ChatGPT Plus → GPT-5.2
22.0M
ChatGPT Plus · $20/mo (57.7%)
#9 💳 Sub ChatGPT Pro $100 → GPT-5.2
22.0M
ChatGPT Pro $100 · $100/mo · 📐 est. (57.7%)
#10 💳 Sub ChatGPT Plus → GPT-5.2 Codex (xhigh)
19.7M
ChatGPT Plus · $20/mo (51.8%)
#11 💳 Sub ChatGPT Pro $100 → GPT-5.2 Codex (xhigh)
19.7M
ChatGPT Pro $100 · $100/mo · 📐 est. (51.8%)
#12 💳 Sub ChatGPT Plus → GPT-4o
14.9M
ChatGPT Plus · $20/mo (39.1%)
#13 💳 Sub ChatGPT Pro $100 → GPT-4o
14.9M
ChatGPT Pro $100 · $100/mo · 📐 est. (39.1%)
#14 💳 Sub ChatGPT Plus → GPT-5.5 (xhigh)
14.3M
ChatGPT Plus · $20/mo (75.0%)
#15 💳 Sub ChatGPT Pro $100 → GPT-5.5 (xhigh)
14.3M
ChatGPT Pro $100 · $100/mo · 📐 est. (75.0%)
#16 💳 Sub Claude Max 20× → Claude Sonnet 4.6 (Non-reasoning, High Effort)
4.9M Claude Max 20× · $200/mo (63.4%)
#17 💳 Sub Claude Max 20× → Claude 4.5 Sonnet (Non-reasoning)
4.0M Claude Max 20× · $200/mo (51.8%)
#18 💳 Sub ChatGPT Business → GPT-5.4 (xhigh)
4.0M ChatGPT Business · $30/mo (64.5%)
#19 💳 Sub ChatGPT Business → GPT-5.2
3.6M ChatGPT Business · $30/mo (57.7%)
#20 💳 Sub Claude Max 20× → Claude Sonnet 4
3.5M Claude Max 20× · $200/mo (45.6%)

⚔️ Compare — any two options side-by-side

VS

📈 Timeline — how value has changed over time

Value over time by provider

Subscription series only includes plans we've directly measured (ChatGPT Plus, ChatGPT Business, Claude Pro, Claude Max 20×). Plans we haven't measured — ChatGPT Pro, Claude Max 5×, Gemini Advanced — appear in the main rank list with a 📐 est. badge but aren't plotted here.

Subscription tokens over time

How many tokens each plan gives you per day, based on our empirical measurements.

📏 Why benchmarks are hard

Not all benchmarks age equally.

Comparing GPT-3.5 to GPT-5.4 using benchmark scores sounds simple. It isn't. The tests used to measure models in 2023 are mostly useless today — either saturated or discontinued. This matters for any long-term value comparison.

📈

Saturation

MMLU (2020) and HumanEval (2021) were rigorous tests when introduced. Today GPT-4 scores 87% on MMLU, GPT-5 scores ~90%. A 3% gap in a benchmark where the ceiling is 100% tells you almost nothing. The benchmark is broken as a signal, not the models.

MMLU — saturated HumanEval — saturated
🎯

Different tests, different eras

SWE-bench Verified launched in 2024. Aider Polyglot in 2024. GPQA Diamond in 2023. Models from 2022 were never measured on these. You can't compare GPT-3.5's MMLU score to GPT-5's SWE-bench score — they're measuring different things with different scales.

SWE-bench — 2024+ Aider — 2024+ GPQA — 2023+
🧪

Training contamination

Benchmarks become worthless once labs train on them. Questions leak into pretraining data, benchmark scores stop reflecting real capability. This is why new benchmarks are invented constantly — and why scores from 2022 are especially suspect.

GSM8K — contaminated Most 2022 benchmarks
✅ What actually works across time

Two signals have been collected continuously since 2023 using the same methodology. They measure different things, but together they give a consistent cross-generation quality score.

⚔️
Arena ELO since early 2023

Humans pick which model response is better in blind head-to-head comparisons. The ELO rating system means GPT-3.5 and GPT-5.4 are measured on the exact same scale — not by what questions they answered, but by how humans prefer their outputs relative to each other.

✓ Same methodology since launch
✓ Covers general + coding separately
✓ Can't be trained on directly
~ Reflects human preference, not task accuracy
🔬
AA Intelligence Index since 2023

Artificial Analysis runs their own evaluations on every major model using consistent infrastructure and aggregates them into a single 0–100 composite. Unlike leaderboard scores that depend on who submitted, AA re-runs everything themselves on the same hardware.

✓ Independently run, not self-reported
✓ Composite — not reliant on a single test
✓ Covers models back to GPT-3.5 era
~ Methodology updates occasionally
How we combine them
Arena ELO → z-score (0–100) + AA Intelligence Index (0–100) ÷ 2 = Stable Quality Score

Both scores are z-score normalized: (score − mean) / std × 15 + 50, centering each at 50 on its own distribution. This ensures Arena ELO and AA Intelligence contribute equally to the average — without normalization, Arena's larger numbers would dominate. The two normalized scores are then averaged. If only one is available for a model, that single score is used. Task-specific benchmarks (SWE-bench, Aider, etc.) are shown in raw data but not used in the main value calculation — they can't fairly compare across model generations.

🧭 Model × Benchmark matrix

Which models are good at what?

The single-number rankings above blend quality into one score. But models win on different axes — math, coding, long-context reasoning, agentic work. This matrix shows each model's score across benchmarks, color-coded per column by z-score: green = top, red = bottom, hatched = no data.

Loading matrix…
💳 How we measure subscription quotas

Providers don't publish token limits. We measure them.

Neither OpenAI nor Anthropic publish exact daily token quotas for subscriptions. Both use rolling 5-hour windows and weekly limits expressed as percentages — not absolute numbers. So we measure empirically.

1. Run a standardized task

We run the same coding task (doubly-linked list + 10 tests) through Codex CLI or Claude Code with --json output, which gives exact token counts per API turn — input, cached, output, and reasoning tokens.

2. Read quota before & after

The CLI's /status command shows your 5-hour and weekly limits as percentages. We record these before and after the task. The delta tells us what fraction of the quota our known token count consumed.

3. Calculate total quota

total_quota = tokens_consumed ÷ (pct_consumed / 100)
Example: 2M tokens consumed 6% of the weekly limit → weekly quota ≈ 33M tokens. Formula: weekly × 4 × quality ÷ monthly price.

Token counts include system prompt (~70K), cached input, reasoning overhead, and tool calls — not just user-visible output. Reasoning effort matters: xhigh uses 1.7× more tokens than medium for the same task. Full methodology →

📊 Our measurements

PlanModelToolDateFilesTask runsQuota used5h windowWeekly est
Businessgpt-5.4Codex2026-04-24645h:22% wk:3%5.0M37M
Claude Max 20xOpus 4.7Claude2026-04-2472005h:0% wk:3%184M
Claude ProOpus 4.7Claude2026-04-243205h:0% wk:8%16M
Plusgpt-5.4Codex2026-04-24995h:21% wk:3%28.2M198M

Files = separate measurement sessions we've run on this plan (variance expected). Task runs = how many times the benchmark task ran within the best session shown. Quota used = how much of the 5-hour and weekly limits our test consumed (higher = more reliable extrapolation). Raw data →

Help improve this data. Run bash scripts/measure-codex-quota.sh on your plan and submit your results.

💰 API-equivalent value

Your subscription, priced at API rates

If you bought the same tokens directly through the provider's API, how much would your subscription's weekly quota cost? We can work this out because the measurement captures exact input / cached / output tokens alongside the quota consumed.

Plan Model measured You pay API-equivalent Multiplier Cache%
ChatGPT Plus gpt-5.5 $20/mo $561/mo 28.1× 83%
ChatGPT Plus gpt-5.4 $20/mo $547/mo 27.3× 86%
Claude Max 20× Opus 4.7 $200/mo $2.10K/mo 10.5× 93%
Claude Pro Opus 4.7 $20/mo $198/mo 9.9× 81%
Claude Max 20× Sonnet 4.6 $200/mo $1.64K/mo 8.2× 92%
Claude Pro Sonnet 4.6 $20/mo $162/mo 8.1× 77%
ChatGPT Business gpt-5.5 $30/mo $197/mo 6.6× 67%
Claude Max 20× Sonnet 4.5 $200/mo $936/mo 4.7× 100%
ChatGPT Business gpt-5.4 $30/mo $117/mo 3.9× 88%
How we calculated this

For each plan + model measurement, we extract from the CLI's --json output:

  • Non-cached input tokens — billed at full API input price
  • Cached input tokens — billed at 10% of input price (OpenAI & Anthropic standard)
  • Output tokens — billed at full API output price
  • Percent of weekly quota consumed during the run

Extrapolate each token category to 100% weekly, multiply by 4.33 weeks/month, then apply the model's public API pricing:

monthly_api_cost = (
    non_cached_input_per_month × $input_rate
  + cached_input_per_month     × $input_rate × 0.10
  + output_per_month           × $output_rate
)
multiplier = monthly_api_cost ÷ subscription_price

Worked example — ChatGPT Plus on GPT-5.4:

One measurement session consumed 3% of the weekly quota with 5.90M input tokens (5.07M of them cached) and 30.5K output tokens. Scaling to 100% weekly and ×4.33 for monthly gives 119M non-cached input, 732M cached input, 4.4M output. At GPT-5.4's $2.50 input / $15 output rate:

119M × $2.50/M  = $298    (non-cached input)
732M × $0.25/M  = $183    (cached input)
4.4M × $15/M    = $66     (output)
─────────────────────────
                  $547/mo  at API pricing
÷ $20/mo subscription     = 27× multiplier
Important caveats

1. High cache hit rates reflect real CLI usage, not a test artifact. Our task hits 67–93% cache reads. We tested whether this was inflated by repeating the same prompt: ran a session with unique nonces per call (CACHE_BUST=1). On Codex the cache rate barely moved (88% → 88%). On Claude Code it dropped from ~92% to ~77% — meaningful but still high. Most of what gets cached is the CLI's own system context (system prompt, tool definitions, prior turns), not the specific user task. Heavy CLI users will see cache rates roughly in this range too. Light users with very different prompts each session would see lower rates and worse subscription value than this multiplier suggests.

2. The multiplier compares CLI-via-subscription to API-with-caching. A developer building the same agent loop directly against the API can enable prompt caching and pay roughly what we calculate. So the comparison is "running this CLI through your subscription vs paying for the same workload via API with caching enabled". A naïve API user who doesn't enable caching would pay ~10× more on the cached portion, making the subscription look even better — but that's not a fair comparison.

3. The multiplier assumes you fully utilize the quota. If you only use 10% of your Plus quota per month, you're getting 10% of the multiplier — possibly worse value than API metered billing. The subscription wins if you're a heavy user.

4. Cache pricing convention. Anthropic exposes three input tiers — fresh input at full price, cache writes at 1.25× input price, cache reads at 0.10× input. OpenAI exposes two — fresh input and cached input at 0.10×. We bill all three Anthropic tiers correctly using the values reported by Claude Code's usage object. Cache writes at 1.25× input had been mislabelled as cache hits in earlier versions of our calculation; this was corrected on Apr 26, 2026, with measurable upward revisions to Claude multipliers.

5. Subscription unit economics aren't part of this. These numbers are "what API would cost", not "what it costs the provider to run". Providers likely price quotas to match expected real usage; the multiplier is a value comparison from the user's side, not a margin claim about the provider.

📋 Raw data

All the data, all the tables

📋 Raw data — all models, hardware, and subscriptions

🤖 Models Edit on GitHub ↗

ModelProviderReleaseCard

📊 Benchmarks Edit on GitHub ↗

ModelAA IntelligenceArena Text ELOArena Code ELOSWE-benchAider

🔌 API Pricing Edit on GitHub ↗

ModelInput $/MOutput $/Mtok/sSource

🖥️ Local Performance Edit on GitHub ↗

ModelHardwareTok/sQuantVRAMSource

💳 Subscriptions (⚠️ Estimated) Edit on GitHub ↗

ModelPlan$/moTok/weekNotesSource

🔧 Hardware Edit on GitHub ↗

HardwarePriceVRAMYearSource

📦 Use this data in your project

JSON data files

All model, hardware, and benchmark data is stored as open JSON files. Fetch them directly — no API key needed.

models.json · hardware.json · benchmarks.json

Attribution

Free to use under Apache 2.0. If you use the data or host a fork, please credit:

"Data from Best Value AI, supported by Desktop Commander"

Contribute

Missing a model? Have local benchmark data? Let your AI agent submit a PR with your hardware's performance data, or contribute manually.

Let your AI agent contribute data ↗

Best Value AI Models — 2026-04-26

Comparing 37 AI models across API pricing, subscriptions, and local hardware.

Best API value: GLM-4.7 Flash at $0.15/M blended — 2.8M quality-adjusted tokens per dollar.

Best subscription value: ChatGPT Pro $200 — 49.1M quality-adjusted tokens per dollar.

Best local value: Qwen3.5 35B A3B (Reasoning) (110 tok/s · RTX 3090) — 2.9M quality-adjusted tokens per dollar.

Top 10 AI Models by Value (Quality-Adjusted Tokens per Dollar)

  1. ChatGPT Pro $200 → GPT-5.4 (xhigh) (Subscription) — 49.1M tok/$ — ChatGPT Pro $200 · $200/mo · 📐 est.
  2. ChatGPT Pro $200 → GPT-5.2 (Subscription) — 43.9M tok/$ — ChatGPT Pro $200 · $200/mo · 📐 est.
  3. ChatGPT Pro $200 → GPT-5.2 Codex (xhigh) (Subscription) — 39.4M tok/$ — ChatGPT Pro $200 · $200/mo · 📐 est.
  4. ChatGPT Pro $200 → GPT-4o (Subscription) — 29.7M tok/$ — ChatGPT Pro $200 · $200/mo · 📐 est.
  5. ChatGPT Pro $200 → GPT-5.5 (xhigh) (Subscription) — 28.6M tok/$ — ChatGPT Pro $200 · $200/mo · 📐 est.
  6. ChatGPT Plus → GPT-5.4 (xhigh) (Subscription) — 24.5M tok/$ — ChatGPT Plus · $20/mo
  7. ChatGPT Pro $100 → GPT-5.4 (xhigh) (Subscription) — 24.5M tok/$ — ChatGPT Pro $100 · $100/mo · 📐 est.
  8. ChatGPT Plus → GPT-5.2 (Subscription) — 22.0M tok/$ — ChatGPT Plus · $20/mo
  9. ChatGPT Pro $100 → GPT-5.2 (Subscription) — 22.0M tok/$ — ChatGPT Pro $100 · $100/mo · 📐 est.
  10. ChatGPT Plus → GPT-5.2 Codex (xhigh) (Subscription) — 19.7M tok/$ — ChatGPT Plus · $20/mo

All Models Compared

Claude Opus 4, Claude Sonnet 4, DeepSeek V3.1, Gemini 2.5 Pro, GLM-4.7 Flash, GPT-OSS 120B, GPT-4o, Llama 3.1 70B, Llama 3.1 8B, o3, Qwen QwQ-32B, Claude Opus 4.5, Gemini 3 Pro, Gemini 3 Flash, GPT-5.2, Claude Opus 4.6 (Non-reasoning, High Effort), Claude Sonnet 4.6 (Non-reasoning, High Effort), Claude 4.5 Haiku (Non-reasoning), Claude 4.5 Sonnet (Non-reasoning), GPT-5.4 (xhigh), GPT-5.3 Codex (xhigh), GPT-5.2 Codex (xhigh), Gemini 3.1 Pro Preview, GLM-5 (Reasoning), GLM-4.7 (Reasoning), MiniMax-M2.5, MiniMax-M2.7, Kimi K2.5 (Reasoning), DeepSeek V3.2 (Non-reasoning), Qwen3.5 35B A3B (Reasoning), Qwen3.5 27B (Reasoning), Qwen3.5 122B A10B (Reasoning), Gemma 4 31B, Gemma 4 26B A4B, Claude Opus 4.7 (Non-reasoning, High Effort), Claude Opus 4.7 (Reasoning, Max Effort), GPT-5.5 (xhigh)