How I Benchmark LLMs on AL Code

When I started evaluating LLMs for Business Central development, I ran into a problem. The standard code generation benchmarks like HumanEval and MBPP measure Python performance. They tell you nothing about whether a model can write AL, the language used in Microsoft Dynamics 365 Business Central.

So I built CentralGauge, an open source benchmark specifically for AL code generation. This post explains the methodology, the challenges I encountered, and what I learned about how different models perform on AL code.

The Challenge of Domain Specific Code Generation

AL is a niche language. Unlike Python or JavaScript, there are far fewer AL code samples in the training data of most LLMs. This creates an interesting test: can models generalise their programming knowledge to an unfamiliar domain?

AL has several characteristics that make it distinct:

Syntax differences: AL uses procedure instead of function, begin/end blocks instead of braces, and has triggers that fire on database operations. The object model is unique with tables, pages, codeunits, reports, and queries.

Business Central conventions: Every AL object needs a numeric ID. Tables require proper field captions, data classification settings, and key definitions. Pages must reference a source table and define layout in a specific structure.

Standard library patterns: Working with records, assertions, and test frameworks follows Business Central conventions. The Record type, Assert codeunit, and page testing patterns are specific to this ecosystem.

These quirks mean a model might be excellent at Python but struggle with AL. Generic benchmarks cannot reveal this gap.

How CentralGauge Works

The benchmark follows a straightforward flow:

Task YAML to LLM Call to Code Extraction to BC Compile to Tests, with retry loop on failure

I defined 56 tasks organized into three difficulty tiers: easy, medium, and hard. Each task lives in a YAML file that specifies what the model should create, what tests must pass, and how to score the result.

The actual compilation and testing happen inside a Business Central container. This is the same environment developers use, ensuring benchmark tests are real-world viable rather than syntactic approximations.

I run multiple models in parallel against the same tasks. This enables direct comparison under identical conditions.

Task Design Philosophy

Creating good benchmark tasks requires discipline. The goal is to test whether a model knows AL, not whether it can follow instructions.

Here is a real task from the benchmark:

id: CG-AL-E001
description: >-
  Create a simple AL table called "Product Category" with ID 70000.
  The table should have the following fields:
  - Code (Code[20], primary key)
  - Description (Text[100])
  - Active (Boolean, default true)
  - Created Date (Date)

  Include proper captions and data classification.
expected:
  compile: true
  testApp: tests/al/easy/CG-AL-E001.Test.al

Notice what the task does NOT say: it does not explain that AL uses InitValue for defaults, or how to define a primary key. The model should know this. If it does not, that is a valid test failure.

Verifiable through real tests: Every task includes a test codeunit that runs in the BC container. These tests check that the generated code actually works, not just that it compiles. For example, a table test verifies that default values persist correctly after insert operations.

Clear requirements without ambiguity: I specify exact field names, types, and behaviours. Vague specifications like “create a useful table” produce unmeasurable results.

The Two Attempt Mechanism

A single attempt benchmark misses something important: the ability to self-correct. Real development involves fixing mistakes. I wanted to measure that capability.

Each task allows two attempts:

Attempt one: The model receives the task description and generates code from scratch.

Attempt two: If attempt one fails (compilation error or test failure), the model receives the error output and must provide a fix. The fix comes as a unified diff, forcing the model to reason about what specifically went wrong.

This mechanism reveals different model behaviours. Some models consistently pass on the first attempt. Others frequently fail initially but recover on attempt two, showing strong debugging capability. A few models fail both attempts, unable to either generate correct code or diagnose their mistakes.

The scoring reflects this: passing on attempt two incurs a 10 point penalty compared to passing on attempt one. The penalty is enough to differentiate first try success from eventual success, but not so severe that self correction becomes worthless.

Scoring Methodology

Fair comparison requires transparent scoring. I use a point based system:

Scoring breakdown: 50 points compilation, 30 points tests, 10 points required patterns, 10 points forbidden patterns

A task passes when the score reaches 70 or higher with no critical failures. The attempt penalty subtracts 10 points per additional attempt needed.

This weighting reflects priorities. Compilation is foundational (50 points) because non compiling code provides zero value. Test passage validates correctness (30 points). Pattern checks (20 points combined) catch specific issues like missing required attributes or the presence of deprecated constructs.

The pattern checks serve a specific purpose. Some tasks require the model to use specific AL features (such as setting Access = Public on a codeunit). Others forbid certain patterns (like using deprecated syntax). These checks ensure the model demonstrates knowledge beyond “code that happens to work.”

Parallel Execution at Scale

Running benchmarks across multiple models introduces practical challenges.

Parallel execution: one task runs simultaneously across Opus, Sonnet, GPT, and Gemini models

Rate limiting varies by provider: Anthropic, OpenAI, and Google each have different quotas for requests per minute and tokens per minute. The benchmark respects these limits through a token bucket rate limiter that tracks usage per provider.

The BC container is a shared resource: Unlike LLM calls, which can run in parallel, compilation must be serialised. The container becomes unstable if multiple compilations run simultaneously. A FIFO queue ensures only one compilation happens at a time while parallel LLM calls continue.

Cost tracking enables comparison: Token usage and estimated costs are recorded per task per model. This reveals which models are cost effective for AL code generation versus which consume excessive tokens for marginal quality improvements.

What I Learned

Running benchmarks across multiple models revealed clear performance differences.

Model pass rates on AL code generation benchmark

Opus 4.5 leads at 66%: Claude’s largest model achieved the highest pass rate, successfully completing two thirds of the benchmark tasks. Gemini 3 Pro followed at 61%, with Sonnet 4.5 and GPT 5.2 in the mid 50s.

The gap between top and bottom is significant: From Opus at 66% to the budget models around 37%, the spread is nearly 30 percentage points. This matters for production use where reliability is critical.

Self correction quality differs from generation quality: Some models generate mediocre first attempts but excel at debugging when given error feedback. Others produce good initial code but struggle to interpret compilation errors. The two attempts mechanism exposed these differences.

Cost efficiency varies dramatically: Gemini 3 Pro used four times as many tokens as Opus but cost five times less ($0.50 vs $2.77). Token pricing differs so dramatically between providers that the cheapest model per token is not necessarily the cheapest per task.

I publish the latest benchmark results at ai.sshadows.dk, including detailed breakdowns by task and model. The leaderboard updates as I test new models.

The Code and How to Contribute

CentralGauge is open source at github.com/SShadowS/CentralGuage. The repository includes all 56 benchmark tasks, the execution framework, and documentation for adding new tasks.

If you work with Business Central and want to contribute tasks, the format is straightforward. Define the task in YAML, write a test codeunit that validates the expected behavior, and submit a pull request. The benchmark improves as the task set grows to cover more AL patterns and edge cases.

Conclusion

Generic code benchmarks cannot tell you how an LLM will perform on your specific domain. AL code generation requires understanding Business Central conventions, object structures, and syntax patterns that differ from mainstream languages.

By building a dedicated benchmark with curated tasks, real compilation, and actual test execution, I can measure what matters: whether a model can produce working Business Central code. The two attempt mechanism adds nuance by measuring self correction alongside generation.

The results have been informative. Model ranking on AL tasks does not match generic benchmark rankings. Cost per completion and self correction ability all vary in ways that affect practical utility.

If you are evaluating LLMs for Business Central development, or any niche domain, consider building similar targeted benchmarks.

See you next Friday for another write-up