CentralGauge Benchmark Update: Why the Numbers Changed

Iteration over tests

Shortly after Claude Opus 4.6 launched, I published the first CentralGauge benchmark results comparing 8 LLMs on AL code generation for Microsoft Dynamics 365 Business Central (BC). Those initial numbers told an interesting story, but they weren’t the full picture.

Since then, I’ve made significant fixes to the benchmark infrastructure, task definitions, and test harness. The scores have shifted. Some models improved substantially. Some tasks that appeared impossible turned out to be broken on my end. And results that seemed inconsistent are now stable and reproducible.

This post covers what changed and why the updated results are more trustworthy.


Code Extraction Was Silently Corrupting Model Output

The most impactful bugs were in the code extraction pipeline. Models were generating valid AL code, but the harness was mangling it before compilation.

Missing sanitisation step. Markdown code fences in LLM responses were sometimes not stripped as expected, which allowed backticks to leak into the AL source and cause compilation failures.

Greedy regex on self-correction. When models self-corrected mid-response, a greedy regex captured everything between the first and last code markers — including explanation text between blocks. Switching to a non-greedy match fixed it.

Missing fences on fix prompts. The second-attempt prompt lacked `BEGIN-CODE`/`END-CODE` delimiters, so some models (especially Gemini) prepended prose that leaked into the extracted code.

The net effect: more models now produce compilable code on both attempts because the extractor no longer injects invalid characters or captures stale blocks.


Tasks That Were Impossible to Solve

In the initial run, 11 tasks had a 0% pass rate across all 8 models and all 3 runs. That’s a strong signal that the problem lies with the task, not the models.

I audited each one and found issues like:

  • Test harness bugs that triggered runtime errors regardless of the generated code
  • Missing support files (report layouts) that the BC runtime requires
  • incorrect test policies that blocked valid operations at compile time
  • Assertions that tested the wrong error codes, or didn’t account for BC’s transaction rollback behaviour

These weren’t subtle issues. They were infrastructure failures that made it structurally impossible for any model to pass, regardless of the quality of its generated code. These are on me. Made too many too late in the night.


Vague Specifications Made Scores Noisy

Beyond the completely broken tasks, a larger set had ambiguous descriptions or tests that didn’t verify what the task required. That made scores noisy: a model might pass on one run and fail on the next, depending on arbitrary choices.

The most common issue was that the function signatures in the task specification did not match the function calls in the tests, mostly because the function definitions weren’t explicitly provided each time. Models had to infer parameter names and types, making the task effectively a lottery. I realigned 8 task descriptions so the spec matches the test exactly. Again, too many, too late.

Other fixes included:

  • removing ambiguous phrasing that left models unsure which AL pattern to use
  • correcting task specs that contained invalid AL syntax
  • hardening tests to accept multiple valid implementation approaches

In one case, a task jumped from 31% to 90% simply by handling valid model behaviours (UI popups, error patterns, HTTP calls) that the tests weren’t designed to handle. The models were already doing the right thing.


Updated Rankings

After all fixes, here are the current results across 56 tasks (17 Easy, 16 Medium, 23 Hard), 3 runs each.

  • pass@1: probability that a task passes in a single randomly sampled run
  • pass@3: probability that a task passes at least once across the 3 runs
  • Consistency: fraction of tasks where all 3 runs have the same outcome (all pass or all fail)
RankModelpass@1pass@3ConsistencyCost/run
1Claude Opus 4.676.2%76.8%98.2%$3.72
2Claude Opus 4.5 (50K thinking)75.0%76.8%96.4%$3.09
3GPT-5.2 (thinking=high)64.3%69.6%89.3%$1.21
4Claude Sonnet 4.563.1%66.1%94.6%$1.42
5Gemini 3 Pro55.4%64.3%83.9%$1.01
6Grok Code Fast 152.4%57.1%89.3%$3.59
7DeepSeek V3.247.6%55.4%80.4%$1.22
8Qwen3 Coder Next35.1%39.3%91.1%$1.12

The ranking order is broadly similar to the initial run, but the absolute scores have changed as previously unsolvable tasks became solvable and noisy tasks stabilised.

Key observations:

  • Opus 4.6 at 98.2% consistency is the standout metric. It gives nearly identical results every run, with only a 0.6 percentage point gap between pass@1 and pass@3. Opus 4.5 is close behind at 96.4%.
  • GPT-5.2 leapfrogs Sonnet 4.5, landing at third place with 64.3% pass@1 compared to Sonnet’s 63.1%. Both show solid consistency (89.3% and 94.6% respectively).
  • Gemini 3 Pro’s self-correction gap remains the most striking anomaly: only 3.75% of first-attempt failures are recovered on the second attempt, compared to 24-31% for other models. The BEGIN-CODE/END-CODE fence fix helped, but Gemini’s tendency to rewrite entire responses (rather than making targeted fixes) remains a structural limitation.
  • Cost efficiency favours GPT-5.2 and Gemini 3 Pro at roughly $0.034-0.033 per passed task, while Opus 4.6 costs $0.087 per passed task for its +12-21pp accuracy advantage.

What’s Next

The benchmark continues to evolve. Current priorities:

  • Continued task auditing. I’m still identifying tasks with edge-case issues, particularly around BC runtime behaviours like transaction rollbacks and UI handlers.
  • More models. I plan to add models as they become available, particularly from providers I haven’t yet covered.
  • Agent benchmarks. CentralGauge now supports running AI agents (like Claude Code) in isolated Docker containers. That gives them access to the AL compiler and test runner as tools rather than static prompts. Early results suggest agents substantially outperform single-shot generation on harder tasks. Some models excels in these workloads, Opus 4.6 especially.
  • Scale improvements. Multi-container support, task-level parallelism, and parallelised compilation now make full benchmark runs significantly faster.

The full benchmark results, task definitions, and source code are available on GitHub.

If you have feedback, spot issues, or want to contribute, please open an issue or submit a pull request. The goal is a transparent, reproducible benchmark that drives progress in AL code generation for BC.

Business Central on Linux? Here, hold my beer!

Sorry if this one is a bit long, but think of it as much of a brain dump. I’ve been asked repeatedly how I managed to get Business Central running on Linux using Wine, so here’s the full’ish story.

Business Central doesn’t run on Linux. Everyone knows this. Microsoft built it for Windows, and that’s that.

So naturally, I had to try.

What started as curiosity turned into months of reverse engineering, debugging Wine internals, and learning more about Windows APIs than I ever wanted to know. Stefan Maron and I presented the results at Directions EMEA, and people kept asking for a proper write-up. Here it is. (Ohh no, already over promised)

Why Even Try?

Because nobody had done it before. That’s it. That’s the reason.
Well, Microsoft probably have, but they have the source code. So they select Linux and comment out the functionality that won’t work. But that cheating isn’t available to the rest of us.

The cost savings and performance benefits we discovered later were a nice bonus. Windows runners on GitHub Actions cost twice as much as Linux runners. Builds run faster. Container startup is dramatically quicker. But none of that was the original motivation.

Sometimes you do things just to see if they’re possible.

The Native .NET Attempt

My first approach was simple. BC runs on .NET Core now, and .NET Core runs on Linux. Problem solved, right?

Not even close.

I copied the BC service tier files to a Linux machine and tried to start them. Immediately, it crashed.

The moment you try to start the BC service tier on Linux, it crashes while looking for Windows APIs. The code makes assumptions everywhere. It wants to know which Active Directory domain you’re in, I think it is to make the complete Webservice URLs. It assumes Windows authentication is available. These aren’t just preference checks that fail gracefully. The code is designed for a Windows environment.

I spent a few evenings trying different things, but it became clear this wasn’t going to work. BC has Windows baked into its DNA. So I had to try something else.

Enter Wine

If you can’t make the code Linux native, maybe you can make Linux look enough like Windows. That’s what Wine does. It’s a compatibility layer that translates Windows API calls to Linux equivalents.

Wine has been around forever. It runs thousands of applications. Mostly games and desktop software. Heck, Proton, which Steam uses to run Windows games on Linux, is based on Wine. The keyword there is “mostly.” When I checked Wine’s compatibility database, there were maybe 50 server applications listed, and 48 of them were game servers. And that was out of over 16,000 supported programs.

Server software is a different beast. It uses APIs that desktop applications never touch. HTTP.sys for web serving. Advanced authentication protocols. Service management. Wine’s developers understandably focused on what most people actually use.

But Wine is open source. If something is missing, you can add it. Well, if you can write C, which I last did in university more than 20 years ago. But I have something better than C skills: debugging skills, and a stubborn refusal to give up. Well, energy drinks, and AI. Lots of AI.

The Debug Loop

My approach was brute force. Start the BC service tier under Wine with full debug logging enabled. Watch it crash. Find out which API call failed. Implement or fix that API in Wine. Repeat.

The first crash came immediately. Some localisation API wasn’t returning what BC expected. Easy fix. Then the next crash. And the next.

I kept two resources open at all times: Microsoft’s official API documentation and a decompiler targeting BC’s assemblies. The docs told me what an API was supposed to do. The decompiled code told me exactly how BC was using it. Just a matter of connecting the dots.

Some APIs were straightforward translations. Others required understanding subtle Windows behaviours that aren’t documented anywhere. Why does this particular call return data in this specific format? Because some Windows component, somewhere, expects it that way, and BC inherited that expectation.

Plus, it didn’t help that the Microsoft documentation is often incomplete and just includes placeholder info for some parameters and return values.

I even had to program my own Event Log because that Wine doesn’t have one. So the entire task was just as much a tooling test as a programming one. I created loads of scripts to iterate over and filter out just the logs I needed.

Getting It to Start

Before the service could even stay running, several hurdles arose that had nothing to do with Wine’s API coverage.

SQL Server encryption was an early roadblock. But not because it didn’t work, it was just a hassle to setup. BC insists on encrypted database connections, but the PowerShell cmdlets that normally configure certificates and connection strings don’t run on Linux. I had to reverse engineer what the cmdlets actually do and replicate each step manually.

The same problem hit user management. New-NavServerUser flat out refuses to work without Windows authentication. The cmdlet checks for valid Windows credentials before it does anything else. No Windows, no user creation.

My solution was pragmatic: bypass the cmdlets entirely. I wrote code that injects NavUserPassword users directly into the SQL database. BC stores passwords hashed exactly 100,001 times. Yes, that specific number. Finding that took longer than I’d like to admit.

Kerberos support in Wine was incomplete for the authentication modes BC wanted. Specifically, the SP800-108 CTR HMAC algorithm wasn’t implemented in Wine’s bcrypt. BC uses this for certain key derivation operations, so I had to add it. Again, it was just a matter of seeing in the logs what BC expected and making Wine do that.

When It “Worked”

After a week of this, something happened. The service started. It stayed running. I called an OData endpoint and got… HTTP 200. Success! Sort of.

The response body was empty. And after that first request, the service froze completely.

What was going on?

The HTTP.sys Rabbit Hole

BC uses Windows’ kernel-mode HTTP server (HTTP.sys) for its web endpoints. Wine had a partial implementation, but “partial” is generous. Looking at the httpapi.spec file, I counted 13 functions that were just stubs: HttpWaitForDisconnect, HttpWaitForDisconnectEx, HttpCancelHttpRequest, HttpShutdownRequestQueue, HttpControlService, HttpFlushResponseCache, HttpGetCounters, HttpQueryRequestQueueProperty, HttpQueryServerSessionProperty, HttpQueryUrlGroupProperty, HttpReadFragmentFromCache, HttpAddFragmentToCache, and HttpIsFeatureSupported didn’t even exist.

Wine’s HTTP.sys could accept connections and start processing requests. It just couldn’t reply with a body payload, finish them properly or clean up afterwards. The server literally didn’t know how to release a connection once it was established. That’s why it froze after the first request.

I had to implement actual connection lifecycle management: the IOCTL handlers for waiting on disconnects, cancelling requests, properly sending response bodies with the is_body and more_data flags. Server software needs to close connections cleanly. Games don’t care about that or they used different APIs.

I also had to resort to extensive Wireshark tracing to see what BC expected at the network level. Once I saw the raw HTTP traffic, it was easier to identify the missing pieces. So I compared traffic from a Windows BC instance to a Wine one and identified what was missing or malformed. Then went back to the Wine code and fixed it.

Actually Working

Once the HTTP.sys fixes were in, responses actually came back with content. The freezing stopped.

That first real API response with actual data felt like winning the lottery. Until I noticed the response was always the same. As I had just put a fixed response in the handler to test things. Took me an hour to realise I was looking at my own test code’s output, not BC’s.

Once I removed my test code and let BC handle the responses properly, it actually worked. The web client isn’t functional yet, but that wasn’t the main goal. The core is there: compile extensions, publish apps, run tests. That’s what I was after. Heck, a Hello World which showed the code ran on Linux was enough for me at that point.

Directions EMEA 2025 Presentation

Last year, before Directions EMEA, Stefan Maron reached out. He had heard about my Wine experiments and wanted to collaborate on a presentation. We teamed up and put together a talk showing the journey, the technical details, and a live demo. Well, we skipped the live demo part since doing live demos of experimental software is a recipe for disaster.

Once I had something functional, Stefan and I measured it properly. Same test app, same pipeline, Windows versus Linux.

The first working build: Linux finished in 13.4 minutes versus 18.4 minutes on Windows. That’s 27% faster out of the gate. Not bad, not bad at all.

After optimisation (smaller base image, certain folders in RAM, no disk cleanup overhead), Linux dropped to 6.3 minutes. Windows stayed around 18 minutes. 65% faster. But all this was on GitHub’s hosted runners, what if we could optimize further?

With caching on a self-hosted runner: 2 minutes 39 seconds total. At that point, we’d shifted the bottleneck from infrastructure to BC itself. Just pure service startup time, waiting for metadata to load was the limiting factor.

The container setup phase showed the biggest difference. Wine plus our minimal Linux image pulled and started in about 5 minutes. The Windows container took nearly 16 minutes for the same operation.

What Didn’t Work

The web client doesn’t work yet. I haven’t put much effort into it since it wasn’t the main goal. Last time I tried I had the web server running, but the NST and the web service just wouldnt talk to each other. Stopped there as Directions was coming up and I wanted to focus on the service tier.

The management endpoints don’t function. We had to write custom AL code to run tests via OData instead.

Some extensions that use uncommon .NET APIs crash immediately. If your extension does something exotic with Windows interop, it won’t work here.

What’s Next

This was always a proof of concept. The goal was to answer “can it be done?” and the answer is yes, with caveats.

Big disclaimer: This is purely a “see if I could” project. It’s not ready for production use, and I wouldn’t even recommend it for automated testing pipelines in its current state. It’s an experiment.

The code is up on GitHub.
Mine
BC4Ubuntu is my first try. Don’t use it, as it is messy and unoptimized.
wine64-bc4ubuntu has the custom Wine build.

Stefans
BCOnLinuxBase is the optimised base image.
BCDevOnLinux is the actual Dockerfile for BC. This is the one to use. But be careful, with great power comes great responsibility.

I’ve also got the NST running on ARM hardware. Getting SQL Server to work on ARM is an entirely different project for another time.

Would I run production on this? Absolutely not. But that was never the point.

Sometimes you learn the most by doing things the “wrong” way. But it was a fun ride.

And can you keep a secret? More than 98% of the code was written by AI. If I had done it today, the last 2% would have been included as well.


Stefan Maron contributed significantly to the pipeline work. This was very much a joint effort.

Lazy replication of tables with NodeJS, Custom APIs and Webhooks in Business Central (Part 1)

“What if I could replicate a table to a system in an entirely different database with a different language in an entirely different OS evenly?”

Wondered what I could do with webhooks which wasn’t just a standard use case. This post isn’t a how-to for creating custom API pages in AL or how webhooks work, other people have done this and they are way better at explaining it than me.

The flow I wanted:

  1. Add generic Business Central AL extension, which exposes the installed APIs and the Field table. This is the main, it will not expose any data tables itself, but will be used to handle the communication.
  2. Add extension with custom API pages for the tables needing replication. (In this example it will contain the table also, but normally not.)
  3. Server (called CopyCat) will call the main extension (ALleyCat) and get a list of all API pages within a specific APIGroup (BadSector) and their table structure.
  4. CopyCat will connect to a secondary database and create tables.
  5. CopyCat will copy all records.
  6. CopyCat will subscribe via webhooks for further changes.
  7. Each webhook response is just a record ID and the change type, ex. Updated or Deleted. So CopyCat will either request new/updated record or delete if needed.
    Will keep a list of new or modified records in an array and only request a predefined records pr. sec, so the main service-tier won’t be overloaded.
  8. Periodic requests for table changes in BC and if new fields are detected, they are added to the table.
Continue reading “Lazy replication of tables with NodeJS, Custom APIs and Webhooks in Business Central (Part 1)”

Custom DotNet assemblies in AL

Just making the last touch-ups for the next posts, when I noticed that there is not that many examples on how to use custom assemblies in AL code.

With the release of the full version of Business Central the 1. of October, some info on this topic is more and more relevant. So this is just a quick post based on my MQTT client example.

This is basically just the same code, though some things have been cut out, just to slimline it a bit. Everything related to sending have been removed.

I used txt2al and renamed a few of my vars, had a few overlapping new keywords in AL.

Why I didn’t make it from scratch in AL? Two reasons. First, Txt2al is just a great tool and it works almost flawlessly. Secondly, VS Code is a terrible tool to begin writing AL code which contains DotNet. There is almost no help what-so-ever when using DotNet. So until this is fixed I would recommend you to use C/Side and then convert the code afterwards.

The following have been shown elsewhere, but just want to reiterate it. To use custom assemblies you need to specify what directory they are in and then define them afterwards.

Like here I have mine in a subdir, called .netpackages, of my project folder.

This is then referred to in the .vscode\settings.json file like this.

With this over it is now possible to use the assemblies in the project, but first we need to refer to them like this.

This is where C/SIDE comes handy, as these lines are not possible to find in VS Code. You just like have to know what to write, not impossible, just really annoying.
The same can be said about DotNet triggers, you just have to know them, as there is no autocomplete.
So you can see DotNet development isn’t really implemented in AL yet.

E.g. it is expected that you just know the trigger name and all its parameters. Like the first line in the code below. No help.

Summary:
So if you want to do DotNet in AL, just add the variables and triggers in C/SIDE and then do a txt2al. You will thank yourself afterwards.

The entire project can be found here https://github.com/SShadowS/AL-MQTT-example