About Lumint Currency Management Services Technology Services Lumint Management Advisory Board Press Brexit Case Studies Time To Execution Study FX Hedging Costs LumintBench Whitepaper LumintBench Live Emma Data Visualization Examples Contact Us

LumintBench Whitepaper — April 2026

The Cost of Reasoning: Which LLMs Actually Deliver Under Production Constraints?

158 models. 15 labs. One deterministic logic benchmark — measuring accuracy, cost, latency, and reliability.

158
Models Tested
15
Labs
5
Perfect Scorers
198×
Cost Gap (Same Score)
00 — Why We Built This

Grounded in Investment Operations

Lumint is an outsourced derivatives overlay and technology provider, supporting trades and transactions across many major institutions each day. AI and machine learning are not aspirational for us — they are active infrastructure, deployed to solve the problems that make financial data uniquely hard: automated data quality checks, complex security hierarchies, asset classification, counterparty validation, and the dependency chains that define investment operations at scale.

LumintBench was not built as an academic exercise. It was built because we needed a rigorous way to evaluate AI and LLM reasoning for the specific domain we operate in — and because the publicly available benchmarks weren't measuring the right things for us. The tasks that matter in investment operations require multi-step logic, comparative analysis, hierarchical evaluation, and constraint satisfaction — evaluated not just for accuracy, but across the operational dimensions that define viability in production: cost, latency, and consistency. Those are the reasoning patterns LumintBench is designed to test.

We share these results publicly because the fiduciary duty to investors doesn't stop at any one firm's door. The better our industry collectively evaluates and deploys AI in trading and investment operations, the better off the investors we all serve will be.

01 — Introduction

Accuracy Is Not Enough

Most LLM benchmarks test a single dimension: can the model get the right answer? That matters, of course — but it's an incomplete picture for any team deploying language models in production. A model that scores perfectly but costs ten times more per query, takes five times longer to respond, or intermittently returns unparseable output isn't the right choice for legal analysis, financial modeling, or multi-step decision support.

LumintBench was designed to address this gap. It evaluates LLM reasoning capability using custom logic puzzles that require multi-step deduction across interrelated constraints. Every field in a model's response is independently verified, yielding a fine-grained view of both reasoning quality and output reliability. But critically, it also tracks cost, latency, and error rate — the dimensions that determine whether a model is viable in production or merely impressive in a demo.

This whitepaper presents the results of our April 2026 evaluation: 158 models from 15 labs, tested across three difficulty tiers and four reasoning-effort levels. The findings reveal a market where raw accuracy has converged at the top — but cost efficiency, speed, and reliability still vary by orders of magnitude.

02 — Methodology

How LumintBench Works

This section underpins everything that follows. The scoring structure — difficulty tiers, per-field penalties, and failure normalization — determines what the numbers in every chart actually mean. Worth reading carefully before diving into the results.

LumintBench poses deterministic logic puzzles at three difficulty levels:

  • Easy — identify a single element from the solution.
  • Medium — identify one complete set satisfying all constraints.
  • Hard — identify all valid sets; demands sustained multi-step reasoning without error accumulation.

Each field in a model's answer receives a score of +1 (correct), 0 (valid but wrong), or −1 (hallucinated — not a valid puzzle value). This penalty means a model that confidently hallucinates scores worse than one that simply guesses wrong. Scores are normalized against the maximum possible across all attempts, including failures: if a model returns garbage on one of three attempts, that zero counts against the denominator.

Reasoning effort levels

Models supporting extended reasoning are tested at four effort levels: none, low, medium, and high. Models without reasoning support are tested at none only. This yields a maximum possible score of 444 for reasoning-enabled models and 111 for non-reasoning models, enabling direct comparison of how additional "thinking time" affects accuracy.

Infrastructure

All models are accessed via the OpenRouter API, providing a unified interface with consistent cost and latency tracking per request. Only non-free models with structured output support are included.

03 — The Leaderboard

Overall Rankings: Who Leads on Raw Accuracy

Five models achieved a perfect 444/444 score across all difficulty levels and reasoning efforts. But the cost to get there differs dramatically — from $0.14 to $28.59 for the same benchmark run.

Figure A
Top 20 Models by Total Score
Bars represent total score out of 444 (max). Labels show cost for the full benchmark run. Green = open-weight, blue = closed.
# Model Lab Type Score Cost Latency Errors
Key finding: The top of the leaderboard has converged. Eight models from four different labs scored 441 or above. The meaningful differentiation now lies in cost, speed, and reliability — not raw accuracy.
New Model Spotlight — April 2026

How the Latest Frontier Releases Stack Up

Models released in the past 60 days, evaluated on the same benchmark as all other entries. Context for how the newest releases fit into the broader field.

04 — Score vs. Cost

The Efficiency Frontier: How Much Does a Point Cost?

Perhaps the most actionable dimension in model selection is cost efficiency. Among the five perfect scorers, Qwen's open-weight qwen3.5-flash-02-23 achieved 444/444 for just $0.14 — while OpenAI's gpt-5.2-pro reached the same score for $28.59. That's a 198× cost difference for identical reasoning performance.

Figure B
Score vs. Total Cost (All Models with Score > 50)
Open-Weight
Closed / API-Only
X-axis (cost) uses log scale. Hover over a point for model details.
3,083 pts/$
Best Score-per-Dollar (Perfect Scorer)
Qwen qwen3.5-flash-02-23 — 444 points for $0.14. Open-weight.
$88.56
Most Expensive Benchmark Run
OpenAI gpt-5.4-pro scored 432/444 — not even a perfect score at 632× the cost of the cheapest perfect scorer.
198×
Cost Gap Among Perfect Scorers
Same score (444), wildly different price tags: $0.14 vs. $28.59.
Open Wins
Efficiency Leaderboard Dominated by Open-Weight
Open-weight models occupy the majority of top cost-efficiency slots.
Figure C
Top 12 Models by Score-per-Dollar (Minimum 200 Points)
Score-per-dollar = total score ÷ total cost in USD. Higher is better. A 200-point minimum score floor is applied — without it, a model that costs almost nothing but scores nearly zero would appear at the top of the ranking despite being useless in practice.
05 — Latency

Speed Matters: Who Answers Fastest?

For interactive applications, latency is a hard constraint. A model that takes four minutes to respond isn't viable for real-time decision support, regardless of accuracy. Among models scoring above 300, response times range from 13 seconds to over 7 minutes.

Figure D
Average Latency vs. Score (Models Scoring 300+)
Average response time in seconds across all difficulty levels and reasoning efforts. Lower is better.
Speed champion: Anthropic's claude-opus-4.6-fast averaged just 12.9 seconds per response while scoring 367/444. Google's Gemini 3.1 Pro Preview hit 443/444 in 33.3 seconds. The fastest perfect scorer was Grok 4.20 Multi-Agent Beta at 43.5 seconds.
06 — Reliability & Error Rates

Can You Actually Depend on It?

Reliability might be the most underappreciated dimension in model selection. A model that returns unparseable output, triggers API errors, or silently hallucinates creates real operational risk. LumintBench's scoring penalizes unreliable models directly: failed attempts count as zero against the total denominator, and hallucinated values receive negative scores.

Among the top 20 scorers, 14 achieved a 0% error rate. But several high-scoring models — including Google's Gemini 2.5 Pro variants — showed error rates of 11–19%, meaning roughly one in five to one in six attempts failed to produce a usable response. For a production system processing thousands of queries, that reliability gap compounds into significant operational cost.

19.4%
Highest Error Rate in Top 30
Google gemini-2.5-pro-preview-05-06 scored 385 but failed on nearly 1 in 5 attempts.
14 / 20
Zero-Error Models in Top 20
The majority of top performers were perfectly reliable — errors are not a prerequisite for performance.
07 — Open-Weight Models

The Open-Weight Contenders

Open-weight models have closed the gap with proprietary frontier models — and in several cases, surpassed them on the metrics that matter most for deployment. Qwen's open-weight qwen3.5-flash-02-23 tied for the highest score of any model tested (444/444) at a fraction of the cost of any closed competitor. In fact, four of the five perfect scorers are open-weight Qwen models whose weights are publicly available on Hugging Face — the "plus" and "flash" branding refers to Alibaba Cloud's managed API hosting of those same open models.

Figure F
Top 12 Open-Weight Models
Open-weight models ranked by total score. Cost shown as bar annotation.

The Qwen family dominates open-weight results, placing seven models in the top ten. DeepSeek's v3.2-speciale model scored 333 but at a steep latency cost (1,200+ seconds average). Xiaomi's mimo-v2-flash offers a strong budget option at 286 points for just $0.13. Meta's Llama 4 Maverick, despite its prominence, scored only 47 — suggesting the model's structured output capabilities lag behind its general chat performance.

08 — Lab Comparison

Best-in-Class by Lab

Each major AI lab has a different performance ceiling on LumintBench. Here's how their flagship models compare head-to-head.

Figure G
Best Model per Lab — Score, Cost, and Latency
Showing each lab's highest-scoring model. Bar height = score; labels show cost.
Lab Best Model Score % of Max Cost Avg Latency Error Rate

Provider-level view: reliability varies dramatically by lab

Zooming out from individual models to lab-level aggregates reveals a different picture. Some labs have low overall error rates because most of their models return clean structured output. Others show high aggregate error rates — often because several models in their portfolio failed every attempt (API unavailability, structured output incompatibility, etc.). A provider-level average can mask both strong and weak individual model configurations, so this chart should be read alongside the per-model data above.

Figure H
API + Parse Error Rate by Lab (All Tested Attempts)
Error rate = (API errors + parse errors) / total attempts across all models tested per lab. Labs with models that failed every attempt (e.g., endpoint not found) inflate the aggregate. Do not reject a strong individual model based on its lab average.
09 — The Medium Paradox

Why Models Score Higher on Hard Than Medium

One of the most counterintuitive findings in this evaluation: nearly half of all models (45%) scored a higher percentage on hard questions than on medium questions. This isn't a marginal effect — some models show a gap of 50 or more percentage points in favor of the harder task.

Figure I
Medium Accuracy (%) vs. Hard Accuracy (%) — All Models
Each dot is one model. Points above the diagonal line score better on hard than medium. Hover for details.

The pattern holds across labs and model families. It isn't a quirk of one architecture — it reflects something fundamental about how LLMs approach scoped vs. exhaustive tasks.

A working hypothesis: full-solve self-correction

We don't have visibility into the models' internal reasoning chains, so we can't prove exactly why this happens. But the pattern is consistent with a plausible hypothesis about how LLMs approach scoped vs. exhaustive tasks.

Hard questions require the model to identify all sets of elements in the solution. If the model is effectively solving the entire constraint puzzle to do this, each constraint it satisfies would narrow the remaining solution space, and errors in one set would create contradictions with others. The completeness requirement may force cross-validation — with the model essentially correcting its own work as a byproduct of solving exhaustively.

Medium questions, by contrast, require only one set. If models are scoping their reasoning to the question as asked — attempting to identify the requested set directly rather than solving the full puzzle and extracting the relevant subset — they would lose the cross-validation benefit of full constraint propagation. In this reading, they're being efficiently lazy, and it's costing them accuracy.

Figure J
Largest Hard-vs-Medium Gaps (Models Scoring 100+)
Gap = Hard% − Medium%. Positive values mean the model performed better on the harder task. Only models scoring 100+ total points shown. Negative medium accuracy (e.g. deepseek-v3.2 at −18%) reflects net hallucination penalty — more wrong/invalid values than correct ones.
The human comparison (hypothesis): A human solving a constraint puzzle would almost certainly complete the entire solution matrix regardless of what's being asked — because that's how constraint puzzles work. You solve the whole grid, then read off the answer. If LLMs are instead optimizing for the question scope rather than the solving scope, that gap would explain the pattern we observe. We can't confirm this is what's happening inside the models, but the data is consistent with it.

Possible implication for production: ask for everything, then extract

"Sometimes more is more."
Asking for the full solution — then extracting the part you need — can outperform a narrowly scoped prompt by 30–70 percentage points on the same problem.

If our hypothesis is correct, it carries a direct, actionable lesson for teams using LLMs in production reasoning tasks. When a problem has an interconnected solution structure — legal analysis with multiple dependent clauses, financial models with cross-referencing constraints, planning problems with interrelated variables — it may be better to ask the model for the full solution rather than only the subset you need.

The idea: ask the model to solve the complete problem. Give it the full scope. Let it work through all the constraints, potentially benefiting from the self-correcting dynamics of exhaustive solving. Then, in a downstream step, extract the specific subset of the solution relevant to your use case.

This is the prompt engineering equivalent of a SQL developer who computes a full join and then filters, rather than trying to write a perfectly scoped query that misses edge cases. The extra tokens are cheap compared to the potential accuracy gain. The data from this benchmark suggests the difference can be 30–70 percentage points of accuracy on the same underlying problem — simply by changing how much of the solution you ask the model to produce. This is worth testing in your own domain.

45%
Models Scoring Better on Hard
Nearly half of all tested models achieved higher accuracy on the exhaustive task than the partial one.
+72 pts
Largest Single Gap
OpenAI's gpt-5.1-codex-mini at high effort: 28% on medium, 100% on hard.
Give it all
Production Advice
Ask for the full solution, then extract the subset you need. The extra tokens are cheaper than the accuracy loss.
Scope ≠ solve
Core Insight
LLMs optimize for the question scope, not the optimal solving scope. The narrower the ask, the less self-correction occurs.
10 — The Overthinking Effect

When Reasoning Effort Hurts Performance

The Medium Paradox reveals what happens when models under-scope their solving. The Overthinking Effect reveals the opposite: what happens when models over-apply reasoning to tasks that don't require it. In 24% of models tested across multiple reasoning effort levels, "none" reasoning outperformed "high" reasoning on easy and medium questions.

Figure K
Easy+Medium Accuracy: "None" vs. "High" Reasoning Effort
Each bar pair shows a model's combined easy+medium accuracy at "none" vs. "high" reasoning effort. Only models where "none" outperformed "high" are shown. Sorted by gap size. A missing "high" bar (e.g. gpt-5-nano, gpt-5.1-codex) means the model scored exactly 0% at high effort — the bar exists but has no height.

The most extreme case is OpenAI's gpt-5.1, which scored 42.9% at no reasoning but dropped to −33.3% (net negative from hallucinations) at high effort on easy+medium questions. GPT-5-nano went from 61.9% to 0%. These aren't rounding errors — they represent significant degradation when extended reasoning is applied to simpler tasks.

What makes this pattern especially notable is that the same models often benefit from high reasoning effort on hard questions. OpenAI's o1 dropped 24 percentage points on easy+medium when switching from none to high — but gained 26 points on hard. Qwen's qwen3.5-9b lost 24 points on easy+medium but gained 33 points on hard. One plausible interpretation: extended reasoning adds complexity that helps with genuinely complex problems but can introduce overthinking, second-guessing, or hallucination on tasks that are more straightforward. We flag this as an observed pattern worth investigating, not a confirmed mechanism.

Possible production implication: match reasoning effort to task complexity

If this pattern holds in your domain, a single reasoning effort setting across all query types may be suboptimal. The data suggests exploring a tiered approach: route simpler queries to lower (or no) reasoning effort, and reserve high-effort reasoning for complex, multi-step tasks. This should be validated on your own workloads, as the effect size and direction may vary by model, domain, and prompt structure.

The dual pattern: The Medium Paradox and the Overthinking Effect suggest two sides of the same coin. The data is consistent with a picture where under-scoping hurts (models do better when forced to solve exhaustively) and over-thinking hurts (high reasoning effort degrades performance on simpler tasks). If this holds, the optimal strategy may be to ask the model to solve exhaustively — but at a reasoning effort level calibrated to the actual difficulty of the task. We recommend testing this in your specific deployment context.
11 — Token Usage

More Tokens ≠ Better Answers

One common assumption is that models producing more tokens — especially more reasoning tokens — will perform better. The data tells a more nuanced story.

Figure L
Average Completion Tokens vs. Score (Models Scoring 200+)
Each point is one model. Size indicates total cost.

Several top-performing models are remarkably token-efficient. Google's Gemini 3.1 Pro Preview scored 443 with an average of ~3,800 completion tokens, while xAI's Grok 4.20 Multi-Agent Beta used ~31,500 tokens for the same score. Both reached near-perfection, but one used 8× more tokens to get there. For cost-sensitive deployments where you pay per token, this difference is decisive.

12 — Takeaways

Conclusions for Production Teams

1. Raw accuracy has converged at the frontier

Eight models from four labs scored above 441/444. If your only criterion is "can it solve hard reasoning problems," you have many options. The differentiators are now cost, speed, and reliability.

2. Cost efficiency varies by orders of magnitude

The same perfect score costs anywhere from $0.14 to $28.59 — a 198× range. For high-volume deployments, this is the difference between feasibility and budget exhaustion. Qwen's open-weight flash model is the clear efficiency champion.

3. Open-weight models are production-ready — and leading

Four of five perfect scorers are open-weight Qwen models. Open models dominate the cost-efficiency frontier. For teams that can self-host or use cost-effective inference providers, the economic case for open-weight is overwhelming.

4. Reliability is a hidden cost

Several models that score well on accuracy have error rates of 10–20%, meaning substantial retries and fallback logic in production. Choosing a model with 0% errors at a slightly lower score may be the better engineering decision.

5. Latency and accuracy don't have to trade off

Google's Gemini 3.1 Pro Preview scored 443/444 with an average latency of 33 seconds. Anthropic's Claude Opus 4.6 scored 372 in just 36 seconds. Speed and accuracy coexist — you just need to know where to look.

6. Expensive ≠ better

OpenAI's gpt-5.4-pro cost $88.56 and scored 432. Qwen's qwen3.6-plus scored 444 for $0.59. The most expensive model didn't even achieve a perfect score. Price is not a proxy for quality.

7. Ask for the complete solution, not the subset

The Medium Paradox demonstrates that models reason more accurately when forced to solve exhaustively. In production, framing prompts to request the full solution — then extracting the relevant subset downstream — can yield dramatically better accuracy at minimal additional token cost.

8. Match reasoning effort to task complexity

The Overthinking Effect shows that high reasoning effort can actively degrade performance on simpler tasks — with drops of 60+ percentage points in extreme cases. Production systems handling mixed-complexity workloads should consider routing easier queries to lower reasoning effort and reserving extended thinking for genuinely complex problems.

Model selection framework for production reasoning

The right model depends on the workload. Rather than picking a single "best" model, consider mapping your use cases to different selection criteria.

Use Case Primary Criteria Candidate Families Implementation Note
Interactive expert assistant High score, low latency, zero parse errors Gemini 3.1 Pro, Claude Opus fast, Grok 4.1 Fast Watch premium pricing and provider routing variance
High-throughput structured reasoning Low cost, stable score, parseable JSON Qwen3.5 Flash, Grok 4.1 Fast, Qwen3 variants Monitor output length and cap reasoning effort
Regulated / self-hosted workflow Open-weight, auditable deployment Qwen3-30B-A3B, Qwen3.5-35B-A3B, DeepSeek V3 Verify licenses; reproduce latency on target hardware
Batch deep analysis Score first, latency second GPT-5.2, Claude Opus 4.7, Gemini Pro, Kimi K2.6 Use queues, retries, and cost budgets
Escalation architecture Cheap first pass, premium fallback Low-cost high-score model + frontier reviewer Define escalation triggers, not just model rankings

The escalation pattern

One of the clearest implications of this data is that a tiered architecture often beats a single-model deployment. Use an inexpensive, high-accuracy model (Qwen3.5 Flash, Grok 4.1 Fast) for first-pass reasoning. When confidence is low, the task is high-value, or governance requirements demand it, escalate to a premium frontier model. This approach captures most of the cost savings of cheap models while retaining the quality ceiling of expensive ones. Define escalation triggers based on task complexity, confidence scores, or business value — not just model rankings.

Bottom line: The LLM market in April 2026 offers genuine choice at the frontier of reasoning capability. The winning model depends on your constraints. If cost is king, Qwen's open-weight flash models are unmatched. If latency matters most, Google's Gemini 3.1 Pro or Anthropic's Claude Opus 4.6 are the fastest high-scorers. If you need zero errors and top accuracy, several models from Qwen, OpenAI, xAI, and Google deliver flawlessly. And regardless of which model you choose — frame your prompts for exhaustive solving, not narrow extraction, and calibrate reasoning effort to the actual complexity of the task.
13 — About & Limitations

About LumintBench

LumintBench is a reasoning benchmark that evaluates LLMs using custom, deterministic logic puzzles requiring multi-step deduction. It is designed for machine scoring with no ambiguity in correct answers. The benchmark captures three dimensions — accuracy, cost efficiency, and reliability — to support genuinely informed model selection for production workloads.

Limitations and interpretation

LumintBench measures a specific form of multi-step deductive reasoning. It is not a universal proxy for all enterprise tasks. The dataset contains results from multiple survey dates in April 2026. Provider routing, pricing, and model behavior may change after the observed runs. OpenRouter latency and cost reflect the accessed provider route in this benchmark, not necessarily direct-provider or self-hosted performance.

The benchmark uses multiple attempts per question type, but small sample counts (3 attempts per configuration) can still be sensitive to run-to-run variance. The open vs. closed classification in this document is practical and model-family based; it is not legal advice or a license determination. Teams considering self-hosting or redistribution should verify the specific license and provider terms for the exact model artifact they plan to use.

The analytical hypotheses presented in sections 09 (Medium Paradox) and 10 (Overthinking Effect) are interpretations of observed patterns, not proven causal mechanisms. We encourage readers to test these patterns on their own workloads before adopting them as production strategies.

For full methodology, scoring details, and live results, visit the LumintBench website.