How I Benchmark LLMs on AL Code

When I started evaluating LLMs for Business Central development, I ran into a problem. The standard code generation benchmarks like HumanEval and MBPP measure Python performance. They tell you nothing about whether a model can write AL, the language used in Microsoft Dynamics 365 Business Central.

So I built CentralGauge, an open source benchmark specifically for AL code generation. This post explains the methodology, the challenges I encountered, and what I learned about how different models perform on AL code.

The Challenge of Domain Specific Code Generation

AL is a niche language. Unlike Python or JavaScript, there are far fewer AL code samples in the training data of most LLMs. This creates an interesting test: can models generalise their programming knowledge to an unfamiliar domain?

AL has several characteristics that make it distinct:

Syntax differences: AL uses procedure instead of function, begin/end blocks instead of braces, and has triggers that fire on database operations. The object model is unique with tables, pages, codeunits, reports, and queries.

Business Central conventions: Every AL object needs a numeric ID. Tables require proper field captions, data classification settings, and key definitions. Pages must reference a source table and define layout in a specific structure.

Standard library patterns: Working with records, assertions, and test frameworks follows Business Central conventions. The Record type, Assert codeunit, and page testing patterns are specific to this ecosystem.

These quirks mean a model might be excellent at Python but struggle with AL. Generic benchmarks cannot reveal this gap.

How CentralGauge Works

The benchmark follows a straightforward flow:

Task YAML to LLM Call to Code Extraction to BC Compile to Tests, with retry loop on failure

I defined 56 tasks organized into three difficulty tiers: easy, medium, and hard. Each task lives in a YAML file that specifies what the model should create, what tests must pass, and how to score the result.

The actual compilation and testing happen inside a Business Central container. This is the same environment developers use, ensuring benchmark tests are real-world viable rather than syntactic approximations.

I run multiple models in parallel against the same tasks. This enables direct comparison under identical conditions.

Task Design Philosophy

Creating good benchmark tasks requires discipline. The goal is to test whether a model knows AL, not whether it can follow instructions.

Here is a real task from the benchmark:

id: CG-AL-E001
description: >-
  Create a simple AL table called "Product Category" with ID 70000.
  The table should have the following fields:
  - Code (Code[20], primary key)
  - Description (Text[100])
  - Active (Boolean, default true)
  - Created Date (Date)

  Include proper captions and data classification.
expected:
  compile: true
  testApp: tests/al/easy/CG-AL-E001.Test.al

Notice what the task does NOT say: it does not explain that AL uses InitValue for defaults, or how to define a primary key. The model should know this. If it does not, that is a valid test failure.

Verifiable through real tests: Every task includes a test codeunit that runs in the BC container. These tests check that the generated code actually works, not just that it compiles. For example, a table test verifies that default values persist correctly after insert operations.

Clear requirements without ambiguity: I specify exact field names, types, and behaviours. Vague specifications like “create a useful table” produce unmeasurable results.

The Two Attempt Mechanism

A single attempt benchmark misses something important: the ability to self-correct. Real development involves fixing mistakes. I wanted to measure that capability.

Each task allows two attempts:

Attempt one: The model receives the task description and generates code from scratch.

Attempt two: If attempt one fails (compilation error or test failure), the model receives the error output and must provide a fix. The fix comes as a unified diff, forcing the model to reason about what specifically went wrong.

This mechanism reveals different model behaviours. Some models consistently pass on the first attempt. Others frequently fail initially but recover on attempt two, showing strong debugging capability. A few models fail both attempts, unable to either generate correct code or diagnose their mistakes.

The scoring reflects this: passing on attempt two incurs a 10 point penalty compared to passing on attempt one. The penalty is enough to differentiate first try success from eventual success, but not so severe that self correction becomes worthless.

Scoring Methodology

Fair comparison requires transparent scoring. I use a point based system:

Scoring breakdown: 50 points compilation, 30 points tests, 10 points required patterns, 10 points forbidden patterns

A task passes when the score reaches 70 or higher with no critical failures. The attempt penalty subtracts 10 points per additional attempt needed.

This weighting reflects priorities. Compilation is foundational (50 points) because non compiling code provides zero value. Test passage validates correctness (30 points). Pattern checks (20 points combined) catch specific issues like missing required attributes or the presence of deprecated constructs.

The pattern checks serve a specific purpose. Some tasks require the model to use specific AL features (such as setting Access = Public on a codeunit). Others forbid certain patterns (like using deprecated syntax). These checks ensure the model demonstrates knowledge beyond “code that happens to work.”

Parallel Execution at Scale

Running benchmarks across multiple models introduces practical challenges.

Parallel execution: one task runs simultaneously across Opus, Sonnet, GPT, and Gemini models

Rate limiting varies by provider: Anthropic, OpenAI, and Google each have different quotas for requests per minute and tokens per minute. The benchmark respects these limits through a token bucket rate limiter that tracks usage per provider.

The BC container is a shared resource: Unlike LLM calls, which can run in parallel, compilation must be serialised. The container becomes unstable if multiple compilations run simultaneously. A FIFO queue ensures only one compilation happens at a time while parallel LLM calls continue.

Cost tracking enables comparison: Token usage and estimated costs are recorded per task per model. This reveals which models are cost effective for AL code generation versus which consume excessive tokens for marginal quality improvements.

What I Learned

Running benchmarks across multiple models revealed clear performance differences.

Model pass rates on AL code generation benchmark

Opus 4.5 leads at 66%: Claude’s largest model achieved the highest pass rate, successfully completing two thirds of the benchmark tasks. Gemini 3 Pro followed at 61%, with Sonnet 4.5 and GPT 5.2 in the mid 50s.

The gap between top and bottom is significant: From Opus at 66% to the budget models around 37%, the spread is nearly 30 percentage points. This matters for production use where reliability is critical.

Self correction quality differs from generation quality: Some models generate mediocre first attempts but excel at debugging when given error feedback. Others produce good initial code but struggle to interpret compilation errors. The two attempts mechanism exposed these differences.

Cost efficiency varies dramatically: Gemini 3 Pro used four times as many tokens as Opus but cost five times less ($0.50 vs $2.77). Token pricing differs so dramatically between providers that the cheapest model per token is not necessarily the cheapest per task.

I publish the latest benchmark results at ai.sshadows.dk, including detailed breakdowns by task and model. The leaderboard updates as I test new models.

The Code and How to Contribute

CentralGauge is open source at github.com/SShadowS/CentralGuage. The repository includes all 56 benchmark tasks, the execution framework, and documentation for adding new tasks.

If you work with Business Central and want to contribute tasks, the format is straightforward. Define the task in YAML, write a test codeunit that validates the expected behavior, and submit a pull request. The benchmark improves as the task set grows to cover more AL patterns and edge cases.

Conclusion

Generic code benchmarks cannot tell you how an LLM will perform on your specific domain. AL code generation requires understanding Business Central conventions, object structures, and syntax patterns that differ from mainstream languages.

By building a dedicated benchmark with curated tasks, real compilation, and actual test execution, I can measure what matters: whether a model can produce working Business Central code. The two attempt mechanism adds nuance by measuring self correction alongside generation.

The results have been informative. Model ranking on AL tasks does not match generic benchmark rankings. Cost per completion and self correction ability all vary in ways that affect practical utility.

If you are evaluating LLMs for Business Central development, or any niche domain, consider building similar targeted benchmarks.

See you next Friday for another write-up

Native AL Language Server Support in Claude Code

If you’re using Claude Code for Business Central development, you’ve probably noticed that while it’s great at writing AL code, it doesn’t truly understand your project structure. It can’t jump to definitions, find references, or see how your objects relate to each other.

Until now.

I’ve built native AL Language Server Protocol (LSP) integration for Claude Code. This means Claude now has the same code intelligence that VS Code has: symbol awareness, navigation, and structural understanding of your AL codebase.

Wait, didn’t you already do this?

Yes! A few months ago I contributed AL language support to Serena MCP, which brought symbol-aware code editing to Business Central development. Serena works with any MCP-compatible agent: Claude Desktop, Cursor, Cline, and others.

This native Claude Code integration is different. Instead of going through MCP, it hooks directly into Claude Code’s built-in language server support. The result is a more polished, seamless experience specifically for Claude Code users.

Serena MCP: Universal, works everywhere, requires MCP setup
Native LSP: Claude Code only, tighter integration, zero-config once installed

If you’re using Claude Code as your primary tool, the native integration is the way to go. If you switch between different AI coding assistants, Serena gives you AL support across all of them.

What is this?

The AL Language Server is the engine behind VS Code’s AL extension. It’s what powers “Go to Definition”, “Find All References”, symbol search, and all the other navigation features you use daily.

By integrating this directly into Claude Code, the AI assistant now has access to:

  • Document symbols: all tables, codeunits, pages, fields, procedures in a file
  • Workspace symbols: search across your entire project
  • Go to Definition: jump to where something is defined
  • Go to Implementation: jump to implementations
  • Find References: see everywhere something is used
  • Hover information: type information and documentation
  • Call hierarchy: see what calls what, incoming and outgoing
  • Multi-project support: workspaces with multiple AL apps work fully

This isn’t regex pattern matching. This is the actual Microsoft AL compiler understanding your code.

Why does this matter?

Without LSP, Claude Code treats your AL files as plain text. It can read them, but it doesn’t understand the relationships between objects. Ask it to “find all places where Customer.”No.” is used” and it has to grep through files hoping to find matches.

With LSP, Claude can ask the language server directly. It knows that Customer is a table, that "No." is a field of type Code[20], and it can find every reference instantly.

The difference is like asking someone to find a book in a library by reading every page versus using the catalog system.

Real example

Here’s what Claude Code can do with LSP on a Customer table:

Go To Definition - On CustomerType enum reference at line 77:
→ Defined in CustomerType.Enum.al:1:12

Hover - Same position shows type info:
Enum CustomerType

Document Symbols - Full symbol tree for Customer.Table.al:
Table 50000 "TEST Customer" (Class) - Line 1
  fields (Class) - Line 6
    "No.": Code[20] (Field) - Line 8
      OnValidate() (Function) - Line 13
    Name: Text[100] (Field) - Line 22
    "Customer Type": Enum 50000 CustomerType (Field) - Line 77
    Balance: Decimal (Field) - Line 83
    ...
  keys (Class) - Line 131
    Key PK: "No." (Key) - Line 133
    ...
  OnInsert() (Function) - Line 158
  OnModify() (Function) - Line 168
  UpdateSearchName() (Function) - Line 190
  CheckCreditLimit() (Function) - Line 195
  GetDisplayName(): Text (Function) - Line 206

Every field with its type. Every key with its composition. Every procedure with its line number. Claude can now navigate your code like a developer would.

Requirements

  • Claude Code 2.1.0 or later. Earlier versions have a bug that prevents built-in LSPs from working.
  • VS Code with AL Language extension. The plugin uses Microsoft’s AL Language Server from your VS Code installation.
  • Python 3.10+ in your PATH
  • A Business Central project with standard AL project structure and app.json

Installation

Step 1: Enable LSP Tool

Set the environment variable before starting Claude Code. This is because even as LSPs are now supported, I think they are not production-ready in all instances, hence the active activation:

# PowerShell (current session)
$env:ENABLE_LSP_TOOL = "1"
claude

# PowerShell (permanent)
[Environment]::SetEnvironmentVariable("ENABLE_LSP_TOOL", "1", "User")
# Bash
export ENABLE_LSP_TOOL=1
claude

Step 2: Install the Plugin

  1. Run claude
  2. /plugin marketplace add SShadowS/claude-code-lsps
  3. Type /plugins
  4. Tab to Marketplaces
  5. Select claude-code-lsps
  6. Browse plugins
  7. Select al-language-server-python with spacebar
  8. Press “i” to install
  9. Restart Claude Code

That’s it. The plugin automatically finds the newest AL extension version in your VS Code extensions folder.

Repository: github.com/SShadowS/claude-code-lsps

What’s next?

The current wrapper is Python-based. A few things I’m looking at:

  • Go-compiled binaries for faster startup and no runtime dependencies
  • Better error handling for more graceful recovery when the language server hiccups
  • Testing on more setups with different VS Code versions and extension configurations

Try it out and feedback

If you’re doing BC development with Claude Code, give this a try. The difference in code navigation and understanding should be significant.

I’d love to hear your feedback. What works, what doesn’t.

If you make an issue on Github please add the %TEMP%/al-lsp-wrapper.log as it helps me alot during debugging. This file will be disabled in a few weeks, just need it here in the beginning.

Repository: github.com/SShadowS/claude-code-lsps


This is part of my ongoing work on AI tooling for Business Central development. See also: CentralGauge for benchmarking LLMs on AL code, and my MCP servers for BC integration.

Lazy replication of tables with NodeJS, Custom APIs and Webhooks in Business Central (Part 1)

“What if I could replicate a table to a system in an entirely different database with a different language in an entirely different OS evenly?”

Wondered what I could do with webhooks which wasn’t just a standard use case. This post isn’t a how-to for creating custom API pages in AL or how webhooks work, other people have done this and they are way better at explaining it than me.

The flow I wanted:

  1. Add generic Business Central AL extension, which exposes the installed APIs and the Field table. This is the main, it will not expose any data tables itself, but will be used to handle the communication.
  2. Add extension with custom API pages for the tables needing replication. (In this example it will contain the table also, but normally not.)
  3. Server (called CopyCat) will call the main extension (ALleyCat) and get a list of all API pages within a specific APIGroup (BadSector) and their table structure.
  4. CopyCat will connect to a secondary database and create tables.
  5. CopyCat will copy all records.
  6. CopyCat will subscribe via webhooks for further changes.
  7. Each webhook response is just a record ID and the change type, ex. Updated or Deleted. So CopyCat will either request new/updated record or delete if needed.
    Will keep a list of new or modified records in an array and only request a predefined records pr. sec, so the main service-tier won’t be overloaded.
  8. Periodic requests for table changes in BC and if new fields are detected, they are added to the table.
Continue reading “Lazy replication of tables with NodeJS, Custom APIs and Webhooks in Business Central (Part 1)”

Custom DotNet assemblies in AL

Just making the last touch-ups for the next posts, when I noticed that there is not that many examples on how to use custom assemblies in AL code.

With the release of the full version of Business Central the 1. of October, some info on this topic is more and more relevant. So this is just a quick post based on my MQTT client example.

This is basically just the same code, though some things have been cut out, just to slimline it a bit. Everything related to sending have been removed.

I used txt2al and renamed a few of my vars, had a few overlapping new keywords in AL.

Why I didn’t make it from scratch in AL? Two reasons. First, Txt2al is just a great tool and it works almost flawlessly. Secondly, VS Code is a terrible tool to begin writing AL code which contains DotNet. There is almost no help what-so-ever when using DotNet. So until this is fixed I would recommend you to use C/Side and then convert the code afterwards.

The following have been shown elsewhere, but just want to reiterate it. To use custom assemblies you need to specify what directory they are in and then define them afterwards.

Like here I have mine in a subdir, called .netpackages, of my project folder.

This is then referred to in the .vscode\settings.json file like this.

With this over it is now possible to use the assemblies in the project, but first we need to refer to them like this.

This is where C/SIDE comes handy, as these lines are not possible to find in VS Code. You just like have to know what to write, not impossible, just really annoying.
The same can be said about DotNet triggers, you just have to know them, as there is no autocomplete.
So you can see DotNet development isn’t really implemented in AL yet.

E.g. it is expected that you just know the trigger name and all its parameters. Like the first line in the code below. No help.

Summary:
So if you want to do DotNet in AL, just add the variables and triggers in C/SIDE and then do a txt2al. You will thank yourself afterwards.

The entire project can be found here https://github.com/SShadowS/AL-MQTT-example