Shortly after Claude Opus 4.6 launched, I published the first CentralGauge benchmark results comparing 8 LLMs on AL code generation for Microsoft Dynamics 365 Business Central (BC). Those initial numbers told an interesting story, but they weren’t the full picture.
Since then, I’ve made significant fixes to the benchmark infrastructure, task definitions, and test harness. The scores have shifted. Some models improved substantially. Some tasks that appeared impossible turned out to be broken on my end. And results that seemed inconsistent are now stable and reproducible.
This post covers what changed and why the updated results are more trustworthy.
Code Extraction Was Silently Corrupting Model Output
The most impactful bugs were in the code extraction pipeline. Models were generating valid AL code, but the harness was mangling it before compilation.
Missing sanitisation step. Markdown code fences in LLM responses were sometimes not stripped as expected, which allowed backticks to leak into the AL source and cause compilation failures.
Greedy regex on self-correction. When models self-corrected mid-response, a greedy regex captured everything between the first and last code markers — including explanation text between blocks. Switching to a non-greedy match fixed it.
Missing fences on fix prompts. The second-attempt prompt lacked `BEGIN-CODE`/`END-CODE` delimiters, so some models (especially Gemini) prepended prose that leaked into the extracted code.
The net effect: more models now produce compilable code on both attempts because the extractor no longer injects invalid characters or captures stale blocks.
Tasks That Were Impossible to Solve
In the initial run, 11 tasks had a 0% pass rate across all 8 models and all 3 runs. That’s a strong signal that the problem lies with the task, not the models.
I audited each one and found issues like:
- Test harness bugs that triggered runtime errors regardless of the generated code
- Missing support files (report layouts) that the BC runtime requires
- incorrect test policies that blocked valid operations at compile time
- Assertions that tested the wrong error codes, or didn’t account for BC’s transaction rollback behaviour
These weren’t subtle issues. They were infrastructure failures that made it structurally impossible for any model to pass, regardless of the quality of its generated code. These are on me. Made too many too late in the night.
Vague Specifications Made Scores Noisy
Beyond the completely broken tasks, a larger set had ambiguous descriptions or tests that didn’t verify what the task required. That made scores noisy: a model might pass on one run and fail on the next, depending on arbitrary choices.
The most common issue was that the function signatures in the task specification did not match the function calls in the tests, mostly because the function definitions weren’t explicitly provided each time. Models had to infer parameter names and types, making the task effectively a lottery. I realigned 8 task descriptions so the spec matches the test exactly. Again, too many, too late.
Other fixes included:
- removing ambiguous phrasing that left models unsure which AL pattern to use
- correcting task specs that contained invalid AL syntax
- hardening tests to accept multiple valid implementation approaches
In one case, a task jumped from 31% to 90% simply by handling valid model behaviours (UI popups, error patterns, HTTP calls) that the tests weren’t designed to handle. The models were already doing the right thing.
Updated Rankings
After all fixes, here are the current results across 56 tasks (17 Easy, 16 Medium, 23 Hard), 3 runs each.
pass@1: probability that a task passes in a single randomly sampled runpass@3: probability that a task passes at least once across the 3 runsConsistency: fraction of tasks where all 3 runs have the same outcome (all pass or all fail)
| Rank | Model | pass@1 | pass@3 | Consistency | Cost/run |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 76.2% | 76.8% | 98.2% | $3.72 |
| 2 | Claude Opus 4.5 (50K thinking) | 75.0% | 76.8% | 96.4% | $3.09 |
| 3 | GPT-5.2 (thinking=high) | 64.3% | 69.6% | 89.3% | $1.21 |
| 4 | Claude Sonnet 4.5 | 63.1% | 66.1% | 94.6% | $1.42 |
| 5 | Gemini 3 Pro | 55.4% | 64.3% | 83.9% | $1.01 |
| 6 | Grok Code Fast 1 | 52.4% | 57.1% | 89.3% | $3.59 |
| 7 | DeepSeek V3.2 | 47.6% | 55.4% | 80.4% | $1.22 |
| 8 | Qwen3 Coder Next | 35.1% | 39.3% | 91.1% | $1.12 |
The ranking order is broadly similar to the initial run, but the absolute scores have changed as previously unsolvable tasks became solvable and noisy tasks stabilised.
Key observations:
- Opus 4.6 at 98.2% consistency is the standout metric. It gives nearly identical results every run, with only a 0.6 percentage point gap between pass@1 and pass@3. Opus 4.5 is close behind at 96.4%.
- GPT-5.2 leapfrogs Sonnet 4.5, landing at third place with 64.3% pass@1 compared to Sonnet’s 63.1%. Both show solid consistency (89.3% and 94.6% respectively).
- Gemini 3 Pro’s self-correction gap remains the most striking anomaly: only 3.75% of first-attempt failures are recovered on the second attempt, compared to 24-31% for other models. The
BEGIN-CODE/END-CODEfence fix helped, but Gemini’s tendency to rewrite entire responses (rather than making targeted fixes) remains a structural limitation. - Cost efficiency favours GPT-5.2 and Gemini 3 Pro at roughly $0.034-0.033 per passed task, while Opus 4.6 costs $0.087 per passed task for its +12-21pp accuracy advantage.
What’s Next
The benchmark continues to evolve. Current priorities:
- Continued task auditing. I’m still identifying tasks with edge-case issues, particularly around BC runtime behaviours like transaction rollbacks and UI handlers.
- More models. I plan to add models as they become available, particularly from providers I haven’t yet covered.
- Agent benchmarks. CentralGauge now supports running AI agents (like Claude Code) in isolated Docker containers. That gives them access to the AL compiler and test runner as tools rather than static prompts. Early results suggest agents substantially outperform single-shot generation on harder tasks. Some models excels in these workloads, Opus 4.6 especially.
- Scale improvements. Multi-container support, task-level parallelism, and parallelised compilation now make full benchmark runs significantly faster.
The full benchmark results, task definitions, and source code are available on GitHub.
If you have feedback, spot issues, or want to contribute, please open an issue or submit a pull request. The goal is a transparent, reproducible benchmark that drives progress in AL code generation for BC.























