What's the best value AI model right now?

Use our ranked comparison to see all models sorted by quality-adjusted tokens per dollar — covering local hardware, subscriptions like ChatGPT Plus and Claude Pro, and API pricing from OpenAI, Anthropic, Google, and others.

Best AI subscription vs best local setup — which is better value?

Our head-to-head comparison tool automatically picks the best subscription and best local hardware setup and compares them on quality-adjusted tokens per dollar. You can also pick any two models and compare across API, subscription, and local.

Is AI getting cheaper or more expensive over time?

Our timeline chart tracks value per dollar by provider since 2023. Quality-adjusted token value has generally increased as competition drives prices down while model quality converges at the frontier.

LLM Value Calculator — Best AI Model for Your Money (Local vs API vs Subscription)

📐 How we calculate this

One number. Three categories.

Not all tokens are equal — a token from a smarter model is worth more. We multiply raw token volume by quality score and divide by cost. The result: quality-adjusted tokens per dollar — one metric that puts a $3,500 GPU, a $20/month plan, and a $0.07/M API on the same axis.

🖥️ Local Hardware

tok/s × hours/day × 365 × years × quality %

hardware cost ($) = quality tokens / $

One-time hardware cost amortized over 3 years. tok/s measured per model + hardware combo.

💳 Subscription

tokens/day limit × 365 × years × quality %

$/mo × 12 × years = quality tokens / $

⚠️ Token limits are estimated — providers don't publish exact numbers. Treat as directional.

🔌 API (pay per token)

1,000,000 tokens × quality %

price/M (input% in + output% out) = quality tokens / $

Adjustable input/output ratio — defaults to 75/25 for general use. Coding workloads are ~90/10 (long context, short output), chat is ~50/50, RAG is ~95/5.

Quality scores: Arena ELO + AA Intelligence Index — the two benchmarks that remain comparable across model generations. See the Benchmarks section for why other benchmarks can't fairly compare across eras.

📏 Why benchmarks are hard

Not all benchmarks age equally.

Comparing GPT-3.5 to GPT-5.4 using benchmark scores sounds simple. It isn't. The tests used to measure models in 2023 are mostly useless today — either saturated or discontinued. This matters for any long-term value comparison.

📈

Saturation

MMLU (2020) and HumanEval (2021) were rigorous tests when introduced. Today GPT-4 scores 87% on MMLU, GPT-5 scores ~90%. A 3% gap in a benchmark where the ceiling is 100% tells you almost nothing. The benchmark is broken as a signal, not the models.

MMLU — saturated HumanEval — saturated

🎯

Different tests, different eras

SWE-bench Verified launched in 2024. Aider Polyglot in 2024. GPQA Diamond in 2023. Models from 2022 were never measured on these. You can't compare GPT-3.5's MMLU score to GPT-5's SWE-bench score — they're measuring different things with different scales.

SWE-bench — 2024+ Aider — 2024+ GPQA — 2023+

🧪

Training contamination

Benchmarks become worthless once labs train on them. Questions leak into pretraining data, benchmark scores stop reflecting real capability. This is why new benchmarks are invented constantly — and why scores from 2022 are especially suspect.

GSM8K — contaminated Most 2022 benchmarks

✅ What actually works across time

Two signals have been collected continuously since 2023 using the same methodology. They measure different things, but together they give a consistent cross-generation quality score.

⚔️

Arena ELO since early 2023

lmarena.ai ↗

Humans pick which model response is better in blind head-to-head comparisons. The ELO rating system means GPT-3.5 and GPT-5.4 are measured on the exact same scale — not by what questions they answered, but by how humans prefer their outputs relative to each other.

✓ Same methodology since launch

✓ Covers general + coding separately

✓ Can't be trained on directly

~ Reflects human preference, not task accuracy

🔬

AA Intelligence Index since 2023

artificialanalysis.ai ↗

Artificial Analysis runs their own evaluations on every major model using consistent infrastructure and aggregates them into a single 0–100 composite. Unlike leaderboard scores that depend on who submitted, AA re-runs everything themselves on the same hardware.

✓ Independently run, not self-reported

✓ Composite — not reliant on a single test

✓ Covers models back to GPT-3.5 era

~ Methodology updates occasionally

How we combine them

Arena ELO → z-score (0–100) + AA Intelligence Index (0–100) ÷ 2 = Stable Quality Score

Both scores are z-score normalized: (score − mean) / std × 15 + 50, centering each at 50 on its own distribution. This ensures Arena ELO and AA Intelligence contribute equally to the average — without normalization, Arena's larger numbers would dominate. The two normalized scores are then averaged. If only one is available for a model, that single score is used. Task-specific benchmarks (SWE-bench, Aider, etc.) are shown in raw data but not used in the main value calculation — they can't fairly compare across model generations.

💳 How we measure subscription quotas

Providers don't publish token limits. We measure them.

Neither OpenAI nor Anthropic publish exact daily token quotas for subscriptions. Both use rolling 5-hour windows and weekly limits expressed as percentages — not absolute numbers. So we measure empirically.

1. Run a standardized task

We run the same coding task (doubly-linked list + 10 tests) through Codex CLI or Claude Code with --json output, which gives exact token counts per API turn — input, cached, output, and reasoning tokens.

2. Read quota before & after

The CLI's /status command shows your 5-hour and weekly limits as percentages. We record these before and after the task. The delta tells us what fraction of the quota our known token count consumed.

3. Calculate total quota

total_quota = tokens_consumed ÷ (pct_consumed / 100)
Example: 797K tokens consumed 12% of the 5h window → 5h quota ≈ 6.6M tokens. Weekly is the binding constraint → ~1.9M tokens/day for ChatGPT Plus.

Token counts include system prompt (~70K), cached input, reasoning overhead, and tool calls — not just user-visible output. Reasoning effort matters: xhigh uses 1.7× more tokens than medium for the same task. Full methodology →

Help improve this data. Run bash scripts/measure-codex-quota.sh on your plan and submit your results.

📊 Full data

Compare everything

Calculator

⚔️ Compare

Timeline

Raw data

⚠️ Subscription token limits are estimated — providers don't publish exact numbers. Local values depend on hardware selection.

Hours / day (local)

Amortization

Benchmark

In/Out ratio

Show

All options ranked by quality-adjusted tokens per dollar

#1 🖥️ Local Qwen3.5 35B A3B (Reasoning)

2.9M

110 tok/s · RTX 3090 (41.4%)

#2 🔌 API GLM-4.7 Flash

2.8M

$0.15/M blended (43.1%)

#3 🔌 API Gemma 4 31B

2.3M

$0.21/M blended (47.9%)

#4 🖥️ Local Gemma 4 31B

2.1M

21.9 tok/s · MacBook Pro 2019 i9 64GB (full system) (47.9%)

#5 🖥️ Local Qwen3.5 35B A3B (Reasoning)

2.0M

122 tok/s · RTX 4090 (41.4%)

#6 💳 Sub ChatGPT Plus → GPT-5.4 (xhigh)

1.9M

ChatGPT Plus · $20/mo (65.7%)

#7 💳 Sub ChatGPT Plus → GPT-5.2

1.7M

ChatGPT Plus · $20/mo (58.7%)

#8 🖥️ Local Qwen3.5 27B (Reasoning)

1.5M

48 tok/s · RTX 3090 (49.5%)

#9 💳 Sub ChatGPT Plus → GPT-5.2 Codex (xhigh)

1.5M

ChatGPT Plus · $20/mo (52.7%)

#10 🖥️ Local Qwen3.5 35B A3B (Reasoning)

1.5M

65 tok/s · Mac M3 Max 64GB (41.4%)

#11 🔌 API DeepSeek V3.2 (Non-reasoning)

1.4M

$0.32/M blended (45.6%)

#12 💳 Sub Claude Max 20× → Claude Opus 4.6 (Non-reasoning, High Effort)

1.3M

Claude Max 20× · $200/mo (68.2%)

#13 🖥️ Local Gemma 4 26B A4B

1.3M

48 tok/s · RTX 3090 (40.6%)

#14 💳 Sub Claude Max 20× → Claude Opus 4.5

1.2M

Claude Max 20× · $200/mo (60.6%)

#15 💳 Sub ChatGPT Plus → GPT-4o

1.1M

ChatGPT Plus · $20/mo (39.7%)

#16 🖥️ Local Qwen3.5 27B (Reasoning)

1.1M

58 tok/s · RTX 4090 (49.5%)

#17 🔌 API MiniMax-M2.7

1.1M

$0.52/M blended (57.3%)

#18 🖥️ Local Gemma 4 26B A4B

1.1M

85 tok/s · RTX 5090 (40.6%)

#19 🖥️ Local Llama 3.1 8B

1.0M

216 tok/s · RTX 4090 (12.1%)

#20 💳 Sub Claude Max 20× → Claude Opus 4

1.0M

Claude Max 20× · $200/mo (52.3%)

Model A

Model B

Value over time by provider

🤖 Models Edit on GitHub ↗

Model	Provider	Release	Card

📊 Benchmarks Edit on GitHub ↗

Model	AA Intelligence	Arena Text ELO	Arena Code ELO	SWE-bench	Aider

🔌 API Pricing Edit on GitHub ↗

Model	Input $/M	Output $/M	tok/s	Source

🖥️ Local Performance Edit on GitHub ↗

Model	Hardware	Tok/s	Quant	VRAM	Source

💳 Subscriptions (⚠️ Estimated) Edit on GitHub ↗

Model	Plan	$/mo	Tok/day	Confidence	Notes	Source

🔧 Hardware Edit on GitHub ↗

Hardware	Price	VRAM	Year	Source

Best Value AI Models — 2026-04-14

Comparing 34 AI models across API pricing, subscriptions, and local hardware.

Best API value: GLM-4.7 Flash at $0.15/M blended — 2.8M quality-adjusted tokens per dollar.

Best subscription value: ChatGPT Plus — 1.9M quality-adjusted tokens per dollar.

Best local value: Qwen3.5 35B A3B (Reasoning) (110 tok/s · RTX 3090) — 2.9M quality-adjusted tokens per dollar.

Top 10 AI Models by Value (Quality-Adjusted Tokens per Dollar)

Qwen3.5 35B A3B (Reasoning) (Local) — 2.9M tok/$ — 110 tok/s · RTX 3090
GLM-4.7 Flash (API) — 2.8M tok/$ — $0.15/M blended
Gemma 4 31B (API) — 2.3M tok/$ — $0.21/M blended
Gemma 4 31B (Local) — 2.1M tok/$ — 21.9 tok/s · MacBook Pro 2019 i9 64GB (full system)
Qwen3.5 35B A3B (Reasoning) (Local) — 2.0M tok/$ — 122 tok/s · RTX 4090
ChatGPT Plus → GPT-5.4 (xhigh) (Subscription) — 1.9M tok/$ — ChatGPT Plus · $20/mo
ChatGPT Plus → GPT-5.2 (Subscription) — 1.7M tok/$ — ChatGPT Plus · $20/mo
Qwen3.5 27B (Reasoning) (Local) — 1.5M tok/$ — 48 tok/s · RTX 3090
ChatGPT Plus → GPT-5.2 Codex (xhigh) (Subscription) — 1.5M tok/$ — ChatGPT Plus · $20/mo
Qwen3.5 35B A3B (Reasoning) (Local) — 1.5M tok/$ — 65 tok/s · Mac M3 Max 64GB

All Models Compared

Claude Opus 4, Claude Sonnet 4, DeepSeek V3.1, Gemini 2.5 Pro, GLM-4.7 Flash, GPT-OSS 120B, GPT-4o, Llama 3.1 70B, Llama 3.1 8B, o3, Qwen QwQ-32B, Claude Opus 4.5, Gemini 3 Pro, Gemini 3 Flash, GPT-5.2, Claude Opus 4.6 (Non-reasoning, High Effort), Claude Sonnet 4.6 (Non-reasoning, High Effort), Claude 4.5 Haiku (Non-reasoning), Claude 4.5 Sonnet (Non-reasoning), GPT-5.4 (xhigh), GPT-5.3 Codex (xhigh), GPT-5.2 Codex (xhigh), Gemini 3.1 Pro Preview, GLM-5 (Reasoning), GLM-4.7 (Reasoning), MiniMax-M2.5, MiniMax-M2.7, Kimi K2.5 (Reasoning), DeepSeek V3.2 (Non-reasoning), Qwen3.5 35B A3B (Reasoning), Qwen3.5 27B (Reasoning), Qwen3.5 122B A10B (Reasoning), Gemma 4 31B, Gemma 4 26B A4B

LLM Value
Calculator

🖥️ Local Hardware

💳 Subscription

🔌 API (pay per token)

Saturation

Different tests, different eras

Training contamination

1. Run a standardized task

2. Read quota before & after

3. Calculate total quota

All options ranked by quality-adjusted tokens per dollar

Value over time by provider

🤖 Models Edit on GitHub ↗

📊 Benchmarks Edit on GitHub ↗

🔌 API Pricing Edit on GitHub ↗

🖥️ Local Performance Edit on GitHub ↗

💳 Subscriptions (⚠️ Estimated) Edit on GitHub ↗

🔧 Hardware Edit on GitHub ↗

📦 Use this data in your project

JSON data files

Attribution

Contribute

Best Value AI Models — 2026-04-14

Top 10 AI Models by Value (Quality-Adjusted Tokens per Dollar)

All Models Compared

LLM ValueCalculator

🖥️ Local Hardware

💳 Subscription

🔌 API (pay per token)

Saturation

Different tests, different eras

Training contamination

1. Run a standardized task

2. Read quota before & after

3. Calculate total quota

All options ranked by quality-adjusted tokens per dollar

Value over time by provider

🤖 Models Edit on GitHub ↗

📊 Benchmarks Edit on GitHub ↗

🔌 API Pricing Edit on GitHub ↗

🖥️ Local Performance Edit on GitHub ↗

💳 Subscriptions (⚠️ Estimated) Edit on GitHub ↗

🔧 Hardware Edit on GitHub ↗

📦 Use this data in your project

JSON data files

Attribution

Contribute

Best Value AI Models — 2026-04-14

Top 10 AI Models by Value (Quality-Adjusted Tokens per Dollar)

All Models Compared

LLM Value
Calculator