LumintBench Whitepaper — April 2026
The Cost of Reasoning: Which LLMs Actually Deliver Under Production Constraints?
158 models. 15 labs. One deterministic logic benchmark — measuring accuracy, cost, latency, and reliability.
Grounded in Investment Operations
Lumint is an outsourced derivatives overlay and technology provider, supporting trades and transactions across many major institutions each day. AI and machine learning are not aspirational for us — they are active infrastructure, deployed to solve the problems that make financial data uniquely hard: automated data quality checks, complex security hierarchies, asset classification, counterparty validation, and the dependency chains that define investment operations at scale.
LumintBench was not built as an academic exercise. It was built because we needed a rigorous way to evaluate AI and LLM reasoning for the specific domain we operate in — and because the publicly available benchmarks weren't measuring the right things for us. The tasks that matter in investment operations require multi-step logic, comparative analysis, hierarchical evaluation, and constraint satisfaction — evaluated not just for accuracy, but across the operational dimensions that define viability in production: cost, latency, and consistency. Those are the reasoning patterns LumintBench is designed to test.
We share these results publicly because the fiduciary duty to investors doesn't stop at any one firm's door. The better our industry collectively evaluates and deploys AI in trading and investment operations, the better off the investors we all serve will be.
Accuracy Is Not Enough
Most LLM benchmarks test a single dimension: can the model get the right answer? That matters, of course — but it's an incomplete picture for any team deploying language models in production. A model that scores perfectly but costs ten times more per query, takes five times longer to respond, or intermittently returns unparseable output isn't the right choice for legal analysis, financial modeling, or multi-step decision support.
LumintBench was designed to address this gap. It evaluates LLM reasoning capability using custom logic puzzles that require multi-step deduction across interrelated constraints. Every field in a model's response is independently verified, yielding a fine-grained view of both reasoning quality and output reliability. But critically, it also tracks cost, latency, and error rate — the dimensions that determine whether a model is viable in production or merely impressive in a demo.
This whitepaper presents the results of our April 2026 evaluation: 158 models from 15 labs, tested across three difficulty tiers and four reasoning-effort levels. The findings reveal a market where raw accuracy has converged at the top — but cost efficiency, speed, and reliability still vary by orders of magnitude.
How LumintBench Works
LumintBench poses deterministic logic puzzles at three difficulty levels:
- Easy — identify a single element from the solution.
- Medium — identify one complete set satisfying all constraints.
- Hard — identify all valid sets; demands sustained multi-step reasoning without error accumulation.
Each field in a model's answer receives a score of +1 (correct), 0 (valid but wrong), or −1 (hallucinated — not a valid puzzle value). This penalty means a model that confidently hallucinates scores worse than one that simply guesses wrong. Scores are normalized against the maximum possible across all attempts, including failures: if a model returns garbage on one of three attempts, that zero counts against the denominator.
Reasoning effort levels
Models supporting extended reasoning are tested at four effort levels: none, low, medium, and high. Models without reasoning support are tested at none only. This yields a maximum possible score of 444 for reasoning-enabled models and 111 for non-reasoning models, enabling direct comparison of how additional "thinking time" affects accuracy.
Infrastructure
All models are accessed via the OpenRouter API, providing a unified interface with consistent cost and latency tracking per request. Only non-free models with structured output support are included.
Overall Rankings: Who Leads on Raw Accuracy
Five models achieved a perfect 444/444 score across all difficulty levels and reasoning efforts. But the cost to get there differs dramatically — from $0.14 to $28.59 for the same benchmark run.
| # | Model | Lab | Type | Score | Cost | Latency | Errors |
|---|
How the Latest Frontier Releases Stack Up
Models released in the past 60 days, evaluated on the same benchmark as all other entries. Context for how the newest releases fit into the broader field.
The Efficiency Frontier: How Much Does a Point Cost?
Perhaps the most actionable dimension in model selection is cost efficiency. Among the five perfect scorers, Qwen's open-weight qwen3.5-flash-02-23 achieved 444/444 for just $0.14 — while OpenAI's gpt-5.2-pro reached the same score for $28.59. That's a 198× cost difference for identical reasoning performance.
Speed Matters: Who Answers Fastest?
For interactive applications, latency is a hard constraint. A model that takes four minutes to respond isn't viable for real-time decision support, regardless of accuracy. Among models scoring above 300, response times range from 13 seconds to over 7 minutes.
Can You Actually Depend on It?
Reliability might be the most underappreciated dimension in model selection. A model that returns unparseable output, triggers API errors, or silently hallucinates creates real operational risk. LumintBench's scoring penalizes unreliable models directly: failed attempts count as zero against the total denominator, and hallucinated values receive negative scores.
Among the top 20 scorers, 14 achieved a 0% error rate. But several high-scoring models — including Google's Gemini 2.5 Pro variants — showed error rates of 11–19%, meaning roughly one in five to one in six attempts failed to produce a usable response. For a production system processing thousands of queries, that reliability gap compounds into significant operational cost.
The Open-Weight Contenders
Open-weight models have closed the gap with proprietary frontier models — and in several cases, surpassed them on the metrics that matter most for deployment. Qwen's open-weight qwen3.5-flash-02-23 tied for the highest score of any model tested (444/444) at a fraction of the cost of any closed competitor. In fact, four of the five perfect scorers are open-weight Qwen models whose weights are publicly available on Hugging Face — the "plus" and "flash" branding refers to Alibaba Cloud's managed API hosting of those same open models.
The Qwen family dominates open-weight results, placing seven models in the top ten. DeepSeek's v3.2-speciale model scored 333 but at a steep latency cost (1,200+ seconds average). Xiaomi's mimo-v2-flash offers a strong budget option at 286 points for just $0.13. Meta's Llama 4 Maverick, despite its prominence, scored only 47 — suggesting the model's structured output capabilities lag behind its general chat performance.
Best-in-Class by Lab
Each major AI lab has a different performance ceiling on LumintBench. Here's how their flagship models compare head-to-head.
| Lab | Best Model | Score | % of Max | Cost | Avg Latency | Error Rate |
|---|
Provider-level view: reliability varies dramatically by lab
Zooming out from individual models to lab-level aggregates reveals a different picture. Some labs have low overall error rates because most of their models return clean structured output. Others show high aggregate error rates — often because several models in their portfolio failed every attempt (API unavailability, structured output incompatibility, etc.). A provider-level average can mask both strong and weak individual model configurations, so this chart should be read alongside the per-model data above.
Why Models Score Higher on Hard Than Medium
One of the most counterintuitive findings in this evaluation: nearly half of all models (45%) scored a higher percentage on hard questions than on medium questions. This isn't a marginal effect — some models show a gap of 50 or more percentage points in favor of the harder task.
The pattern holds across labs and model families. It isn't a quirk of one architecture — it reflects something fundamental about how LLMs approach scoped vs. exhaustive tasks.
A working hypothesis: full-solve self-correction
We don't have visibility into the models' internal reasoning chains, so we can't prove exactly why this happens. But the pattern is consistent with a plausible hypothesis about how LLMs approach scoped vs. exhaustive tasks.
Hard questions require the model to identify all sets of elements in the solution. If the model is effectively solving the entire constraint puzzle to do this, each constraint it satisfies would narrow the remaining solution space, and errors in one set would create contradictions with others. The completeness requirement may force cross-validation — with the model essentially correcting its own work as a byproduct of solving exhaustively.
Medium questions, by contrast, require only one set. If models are scoping their reasoning to the question as asked — attempting to identify the requested set directly rather than solving the full puzzle and extracting the relevant subset — they would lose the cross-validation benefit of full constraint propagation. In this reading, they're being efficiently lazy, and it's costing them accuracy.
Possible implication for production: ask for everything, then extract
If our hypothesis is correct, it carries a direct, actionable lesson for teams using LLMs in production reasoning tasks. When a problem has an interconnected solution structure — legal analysis with multiple dependent clauses, financial models with cross-referencing constraints, planning problems with interrelated variables — it may be better to ask the model for the full solution rather than only the subset you need.
The idea: ask the model to solve the complete problem. Give it the full scope. Let it work through all the constraints, potentially benefiting from the self-correcting dynamics of exhaustive solving. Then, in a downstream step, extract the specific subset of the solution relevant to your use case.
This is the prompt engineering equivalent of a SQL developer who computes a full join and then filters, rather than trying to write a perfectly scoped query that misses edge cases. The extra tokens are cheap compared to the potential accuracy gain. The data from this benchmark suggests the difference can be 30–70 percentage points of accuracy on the same underlying problem — simply by changing how much of the solution you ask the model to produce. This is worth testing in your own domain.
When Reasoning Effort Hurts Performance
The Medium Paradox reveals what happens when models under-scope their solving. The Overthinking Effect reveals the opposite: what happens when models over-apply reasoning to tasks that don't require it. In 24% of models tested across multiple reasoning effort levels, "none" reasoning outperformed "high" reasoning on easy and medium questions.
The most extreme case is OpenAI's gpt-5.1, which scored 42.9% at no reasoning but dropped to −33.3% (net negative from hallucinations) at high effort on easy+medium questions. GPT-5-nano went from 61.9% to 0%. These aren't rounding errors — they represent significant degradation when extended reasoning is applied to simpler tasks.
What makes this pattern especially notable is that the same models often benefit from high reasoning effort on hard questions. OpenAI's o1 dropped 24 percentage points on easy+medium when switching from none to high — but gained 26 points on hard. Qwen's qwen3.5-9b lost 24 points on easy+medium but gained 33 points on hard. One plausible interpretation: extended reasoning adds complexity that helps with genuinely complex problems but can introduce overthinking, second-guessing, or hallucination on tasks that are more straightforward. We flag this as an observed pattern worth investigating, not a confirmed mechanism.
Possible production implication: match reasoning effort to task complexity
If this pattern holds in your domain, a single reasoning effort setting across all query types may be suboptimal. The data suggests exploring a tiered approach: route simpler queries to lower (or no) reasoning effort, and reserve high-effort reasoning for complex, multi-step tasks. This should be validated on your own workloads, as the effect size and direction may vary by model, domain, and prompt structure.
More Tokens ≠ Better Answers
One common assumption is that models producing more tokens — especially more reasoning tokens — will perform better. The data tells a more nuanced story.
Several top-performing models are remarkably token-efficient. Google's Gemini 3.1 Pro Preview scored 443 with an average of ~3,800 completion tokens, while xAI's Grok 4.20 Multi-Agent Beta used ~31,500 tokens for the same score. Both reached near-perfection, but one used 8× more tokens to get there. For cost-sensitive deployments where you pay per token, this difference is decisive.
Conclusions for Production Teams
1. Raw accuracy has converged at the frontier
Eight models from four labs scored above 441/444. If your only criterion is "can it solve hard reasoning problems," you have many options. The differentiators are now cost, speed, and reliability.
2. Cost efficiency varies by orders of magnitude
The same perfect score costs anywhere from $0.14 to $28.59 — a 198× range. For high-volume deployments, this is the difference between feasibility and budget exhaustion. Qwen's open-weight flash model is the clear efficiency champion.
3. Open-weight models are production-ready — and leading
Four of five perfect scorers are open-weight Qwen models. Open models dominate the cost-efficiency frontier. For teams that can self-host or use cost-effective inference providers, the economic case for open-weight is overwhelming.
4. Reliability is a hidden cost
Several models that score well on accuracy have error rates of 10–20%, meaning substantial retries and fallback logic in production. Choosing a model with 0% errors at a slightly lower score may be the better engineering decision.
5. Latency and accuracy don't have to trade off
Google's Gemini 3.1 Pro Preview scored 443/444 with an average latency of 33 seconds. Anthropic's Claude Opus 4.6 scored 372 in just 36 seconds. Speed and accuracy coexist — you just need to know where to look.
6. Expensive ≠ better
OpenAI's gpt-5.4-pro cost $88.56 and scored 432. Qwen's qwen3.6-plus scored 444 for $0.59. The most expensive model didn't even achieve a perfect score. Price is not a proxy for quality.
7. Ask for the complete solution, not the subset
The Medium Paradox demonstrates that models reason more accurately when forced to solve exhaustively. In production, framing prompts to request the full solution — then extracting the relevant subset downstream — can yield dramatically better accuracy at minimal additional token cost.
8. Match reasoning effort to task complexity
The Overthinking Effect shows that high reasoning effort can actively degrade performance on simpler tasks — with drops of 60+ percentage points in extreme cases. Production systems handling mixed-complexity workloads should consider routing easier queries to lower reasoning effort and reserving extended thinking for genuinely complex problems.
Model selection framework for production reasoning
The right model depends on the workload. Rather than picking a single "best" model, consider mapping your use cases to different selection criteria.
| Use Case | Primary Criteria | Candidate Families | Implementation Note |
|---|---|---|---|
| Interactive expert assistant | High score, low latency, zero parse errors | Gemini 3.1 Pro, Claude Opus fast, Grok 4.1 Fast | Watch premium pricing and provider routing variance |
| High-throughput structured reasoning | Low cost, stable score, parseable JSON | Qwen3.5 Flash, Grok 4.1 Fast, Qwen3 variants | Monitor output length and cap reasoning effort |
| Regulated / self-hosted workflow | Open-weight, auditable deployment | Qwen3-30B-A3B, Qwen3.5-35B-A3B, DeepSeek V3 | Verify licenses; reproduce latency on target hardware |
| Batch deep analysis | Score first, latency second | GPT-5.2, Claude Opus 4.7, Gemini Pro, Kimi K2.6 | Use queues, retries, and cost budgets |
| Escalation architecture | Cheap first pass, premium fallback | Low-cost high-score model + frontier reviewer | Define escalation triggers, not just model rankings |
The escalation pattern
One of the clearest implications of this data is that a tiered architecture often beats a single-model deployment. Use an inexpensive, high-accuracy model (Qwen3.5 Flash, Grok 4.1 Fast) for first-pass reasoning. When confidence is low, the task is high-value, or governance requirements demand it, escalate to a premium frontier model. This approach captures most of the cost savings of cheap models while retaining the quality ceiling of expensive ones. Define escalation triggers based on task complexity, confidence scores, or business value — not just model rankings.
About LumintBench
LumintBench is a reasoning benchmark that evaluates LLMs using custom, deterministic logic puzzles requiring multi-step deduction. It is designed for machine scoring with no ambiguity in correct answers. The benchmark captures three dimensions — accuracy, cost efficiency, and reliability — to support genuinely informed model selection for production workloads.
Limitations and interpretation
LumintBench measures a specific form of multi-step deductive reasoning. It is not a universal proxy for all enterprise tasks. The dataset contains results from multiple survey dates in April 2026. Provider routing, pricing, and model behavior may change after the observed runs. OpenRouter latency and cost reflect the accessed provider route in this benchmark, not necessarily direct-provider or self-hosted performance.
The benchmark uses multiple attempts per question type, but small sample counts (3 attempts per configuration) can still be sensitive to run-to-run variance. The open vs. closed classification in this document is practical and model-family based; it is not legal advice or a license determination. Teams considering self-hosting or redistribution should verify the specific license and provider terms for the exact model artifact they plan to use.
The analytical hypotheses presented in sections 09 (Medium Paradox) and 10 (Overthinking Effect) are interpretations of observed patterns, not proven causal mechanisms. We encourage readers to test these patterns on their own workloads before adopting them as production strategies.
For full methodology, scoring details, and live results, visit the LumintBench website.