CentralGauge Benchmark Update: Why the Numbers Changed

Iteration over tests

Shortly after Claude Opus 4.6 launched, I published the first CentralGauge benchmark results comparing 8 LLMs on AL code generation for Microsoft Dynamics 365 Business Central (BC). Those initial numbers told an interesting story, but they weren’t the full picture.

Since then, I’ve made significant fixes to the benchmark infrastructure, task definitions, and test harness. The scores have shifted. Some models improved substantially. Some tasks that appeared impossible turned out to be broken on my end. And results that seemed inconsistent are now stable and reproducible.

This post covers what changed and why the updated results are more trustworthy.


Code Extraction Was Silently Corrupting Model Output

The most impactful bugs were in the code extraction pipeline. Models were generating valid AL code, but the harness was mangling it before compilation.

Missing sanitisation step. Markdown code fences in LLM responses were sometimes not stripped as expected, which allowed backticks to leak into the AL source and cause compilation failures.

Greedy regex on self-correction. When models self-corrected mid-response, a greedy regex captured everything between the first and last code markers — including explanation text between blocks. Switching to a non-greedy match fixed it.

Missing fences on fix prompts. The second-attempt prompt lacked `BEGIN-CODE`/`END-CODE` delimiters, so some models (especially Gemini) prepended prose that leaked into the extracted code.

The net effect: more models now produce compilable code on both attempts because the extractor no longer injects invalid characters or captures stale blocks.


Tasks That Were Impossible to Solve

In the initial run, 11 tasks had a 0% pass rate across all 8 models and all 3 runs. That’s a strong signal that the problem lies with the task, not the models.

I audited each one and found issues like:

  • Test harness bugs that triggered runtime errors regardless of the generated code
  • Missing support files (report layouts) that the BC runtime requires
  • incorrect test policies that blocked valid operations at compile time
  • Assertions that tested the wrong error codes, or didn’t account for BC’s transaction rollback behaviour

These weren’t subtle issues. They were infrastructure failures that made it structurally impossible for any model to pass, regardless of the quality of its generated code. These are on me. Made too many too late in the night.


Vague Specifications Made Scores Noisy

Beyond the completely broken tasks, a larger set had ambiguous descriptions or tests that didn’t verify what the task required. That made scores noisy: a model might pass on one run and fail on the next, depending on arbitrary choices.

The most common issue was that the function signatures in the task specification did not match the function calls in the tests, mostly because the function definitions weren’t explicitly provided each time. Models had to infer parameter names and types, making the task effectively a lottery. I realigned 8 task descriptions so the spec matches the test exactly. Again, too many, too late.

Other fixes included:

  • removing ambiguous phrasing that left models unsure which AL pattern to use
  • correcting task specs that contained invalid AL syntax
  • hardening tests to accept multiple valid implementation approaches

In one case, a task jumped from 31% to 90% simply by handling valid model behaviours (UI popups, error patterns, HTTP calls) that the tests weren’t designed to handle. The models were already doing the right thing.


Updated Rankings

After all fixes, here are the current results across 56 tasks (17 Easy, 16 Medium, 23 Hard), 3 runs each.

  • pass@1: probability that a task passes in a single randomly sampled run
  • pass@3: probability that a task passes at least once across the 3 runs
  • Consistency: fraction of tasks where all 3 runs have the same outcome (all pass or all fail)
RankModelpass@1pass@3ConsistencyCost/run
1Claude Opus 4.676.2%76.8%98.2%$3.72
2Claude Opus 4.5 (50K thinking)75.0%76.8%96.4%$3.09
3GPT-5.2 (thinking=high)64.3%69.6%89.3%$1.21
4Claude Sonnet 4.563.1%66.1%94.6%$1.42
5Gemini 3 Pro55.4%64.3%83.9%$1.01
6Grok Code Fast 152.4%57.1%89.3%$3.59
7DeepSeek V3.247.6%55.4%80.4%$1.22
8Qwen3 Coder Next35.1%39.3%91.1%$1.12

The ranking order is broadly similar to the initial run, but the absolute scores have changed as previously unsolvable tasks became solvable and noisy tasks stabilised.

Key observations:

  • Opus 4.6 at 98.2% consistency is the standout metric. It gives nearly identical results every run, with only a 0.6 percentage point gap between pass@1 and pass@3. Opus 4.5 is close behind at 96.4%.
  • GPT-5.2 leapfrogs Sonnet 4.5, landing at third place with 64.3% pass@1 compared to Sonnet’s 63.1%. Both show solid consistency (89.3% and 94.6% respectively).
  • Gemini 3 Pro’s self-correction gap remains the most striking anomaly: only 3.75% of first-attempt failures are recovered on the second attempt, compared to 24-31% for other models. The BEGIN-CODE/END-CODE fence fix helped, but Gemini’s tendency to rewrite entire responses (rather than making targeted fixes) remains a structural limitation.
  • Cost efficiency favours GPT-5.2 and Gemini 3 Pro at roughly $0.034-0.033 per passed task, while Opus 4.6 costs $0.087 per passed task for its +12-21pp accuracy advantage.

What’s Next

The benchmark continues to evolve. Current priorities:

  • Continued task auditing. I’m still identifying tasks with edge-case issues, particularly around BC runtime behaviours like transaction rollbacks and UI handlers.
  • More models. I plan to add models as they become available, particularly from providers I haven’t yet covered.
  • Agent benchmarks. CentralGauge now supports running AI agents (like Claude Code) in isolated Docker containers. That gives them access to the AL compiler and test runner as tools rather than static prompts. Early results suggest agents substantially outperform single-shot generation on harder tasks. Some models excels in these workloads, Opus 4.6 especially.
  • Scale improvements. Multi-container support, task-level parallelism, and parallelised compilation now make full benchmark runs significantly faster.

The full benchmark results, task definitions, and source code are available on GitHub.

If you have feedback, spot issues, or want to contribute, please open an issue or submit a pull request. The goal is a transparent, reproducible benchmark that drives progress in AL code generation for BC.

Native AL Language Server Support in Claude Code

If you’re using Claude Code for Business Central development, you’ve probably noticed that while it’s great at writing AL code, it doesn’t truly understand your project structure. It can’t jump to definitions, find references, or see how your objects relate to each other.

Until now.

I’ve built native AL Language Server Protocol (LSP) integration for Claude Code. This means Claude now has the same code intelligence that VS Code has: symbol awareness, navigation, and structural understanding of your AL codebase.

Wait, didn’t you already do this?

Yes! A few months ago I contributed AL language support to Serena MCP, which brought symbol-aware code editing to Business Central development. Serena works with any MCP-compatible agent: Claude Desktop, Cursor, Cline, and others.

This native Claude Code integration is different. Instead of going through MCP, it hooks directly into Claude Code’s built-in language server support. The result is a more polished, seamless experience specifically for Claude Code users.

Serena MCP: Universal, works everywhere, requires MCP setup
Native LSP: Claude Code only, tighter integration, zero-config once installed

If you’re using Claude Code as your primary tool, the native integration is the way to go. If you switch between different AI coding assistants, Serena gives you AL support across all of them.

What is this?

The AL Language Server is the engine behind VS Code’s AL extension. It’s what powers “Go to Definition”, “Find All References”, symbol search, and all the other navigation features you use daily.

By integrating this directly into Claude Code, the AI assistant now has access to:

  • Document symbols: all tables, codeunits, pages, fields, procedures in a file
  • Workspace symbols: search across your entire project
  • Go to Definition: jump to where something is defined
  • Go to Implementation: jump to implementations
  • Find References: see everywhere something is used
  • Hover information: type information and documentation
  • Call hierarchy: see what calls what, incoming and outgoing
  • Multi-project support: workspaces with multiple AL apps work fully

This isn’t regex pattern matching. This is the actual Microsoft AL compiler understanding your code.

Why does this matter?

Without LSP, Claude Code treats your AL files as plain text. It can read them, but it doesn’t understand the relationships between objects. Ask it to “find all places where Customer.”No.” is used” and it has to grep through files hoping to find matches.

With LSP, Claude can ask the language server directly. It knows that Customer is a table, that "No." is a field of type Code[20], and it can find every reference instantly.

The difference is like asking someone to find a book in a library by reading every page versus using the catalog system.

Real example

Here’s what Claude Code can do with LSP on a Customer table:

Go To Definition - On CustomerType enum reference at line 77:
→ Defined in CustomerType.Enum.al:1:12

Hover - Same position shows type info:
Enum CustomerType

Document Symbols - Full symbol tree for Customer.Table.al:
Table 50000 "TEST Customer" (Class) - Line 1
  fields (Class) - Line 6
    "No.": Code[20] (Field) - Line 8
      OnValidate() (Function) - Line 13
    Name: Text[100] (Field) - Line 22
    "Customer Type": Enum 50000 CustomerType (Field) - Line 77
    Balance: Decimal (Field) - Line 83
    ...
  keys (Class) - Line 131
    Key PK: "No." (Key) - Line 133
    ...
  OnInsert() (Function) - Line 158
  OnModify() (Function) - Line 168
  UpdateSearchName() (Function) - Line 190
  CheckCreditLimit() (Function) - Line 195
  GetDisplayName(): Text (Function) - Line 206

Every field with its type. Every key with its composition. Every procedure with its line number. Claude can now navigate your code like a developer would.

Requirements

  • Claude Code 2.1.0 or later. Earlier versions have a bug that prevents built-in LSPs from working.
  • VS Code with AL Language extension. The plugin uses Microsoft’s AL Language Server from your VS Code installation.
  • Python 3.10+ in your PATH
  • A Business Central project with standard AL project structure and app.json

Installation

Step 1: Enable LSP Tool

Set the environment variable before starting Claude Code. This is because even as LSPs are now supported, I think they are not production-ready in all instances, hence the active activation:

# PowerShell (current session)
$env:ENABLE_LSP_TOOL = "1"
claude

# PowerShell (permanent)
[Environment]::SetEnvironmentVariable("ENABLE_LSP_TOOL", "1", "User")
# Bash
export ENABLE_LSP_TOOL=1
claude

Step 2: Install the Plugin

  1. Run claude
  2. /plugin marketplace add SShadowS/claude-code-lsps
  3. Type /plugins
  4. Tab to Marketplaces
  5. Select claude-code-lsps
  6. Browse plugins
  7. Select al-language-server-python with spacebar
  8. Press “i” to install
  9. Restart Claude Code

That’s it. The plugin automatically finds the newest AL extension version in your VS Code extensions folder.

Repository: github.com/SShadowS/claude-code-lsps

What’s next?

The current wrapper is Python-based. A few things I’m looking at:

  • Go-compiled binaries for faster startup and no runtime dependencies
  • Better error handling for more graceful recovery when the language server hiccups
  • Testing on more setups with different VS Code versions and extension configurations

Try it out and feedback

If you’re doing BC development with Claude Code, give this a try. The difference in code navigation and understanding should be significant.

I’d love to hear your feedback. What works, what doesn’t.

If you make an issue on Github please add the %TEMP%/al-lsp-wrapper.log as it helps me alot during debugging. This file will be disabled in a few weeks, just need it here in the beginning.

Repository: github.com/SShadowS/claude-code-lsps


This is part of my ongoing work on AI tooling for Business Central development. See also: CentralGauge for benchmarking LLMs on AL code, and my MCP servers for BC integration.