Business Central on Linux? Here, hold my beer!

Sorry if this one is a bit long, but think of it as much of a brain dump. I’ve been asked repeatedly how I managed to get Business Central running on Linux using Wine, so here’s the full’ish story.

Business Central doesn’t run on Linux. Everyone knows this. Microsoft built it for Windows, and that’s that.

So naturally, I had to try.

What started as curiosity turned into months of reverse engineering, debugging Wine internals, and learning more about Windows APIs than I ever wanted to know. Stefan Maron and I presented the results at Directions EMEA, and people kept asking for a proper write-up. Here it is. (Ohh no, already over promised)

Why Even Try?

Because nobody had done it before. That’s it. That’s the reason.
Well, Microsoft probably have, but they have the source code. So they select Linux and comment out the functionality that won’t work. But that cheating isn’t available to the rest of us.

The cost savings and performance benefits we discovered later were a nice bonus. Windows runners on GitHub Actions cost twice as much as Linux runners. Builds run faster. Container startup is dramatically quicker. But none of that was the original motivation.

Sometimes you do things just to see if they’re possible.

The Native .NET Attempt

My first approach was simple. BC runs on .NET Core now, and .NET Core runs on Linux. Problem solved, right?

Not even close.

I copied the BC service tier files to a Linux machine and tried to start them. Immediately, it crashed.

The moment you try to start the BC service tier on Linux, it crashes while looking for Windows APIs. The code makes assumptions everywhere. It wants to know which Active Directory domain you’re in, I think it is to make the complete Webservice URLs. It assumes Windows authentication is available. These aren’t just preference checks that fail gracefully. The code is designed for a Windows environment.

I spent a few evenings trying different things, but it became clear this wasn’t going to work. BC has Windows baked into its DNA. So I had to try something else.

Enter Wine

If you can’t make the code Linux native, maybe you can make Linux look enough like Windows. That’s what Wine does. It’s a compatibility layer that translates Windows API calls to Linux equivalents.

Wine has been around forever. It runs thousands of applications. Mostly games and desktop software. Heck, Proton, which Steam uses to run Windows games on Linux, is based on Wine. The keyword there is “mostly.” When I checked Wine’s compatibility database, there were maybe 50 server applications listed, and 48 of them were game servers. And that was out of over 16,000 supported programs.

Server software is a different beast. It uses APIs that desktop applications never touch. HTTP.sys for web serving. Advanced authentication protocols. Service management. Wine’s developers understandably focused on what most people actually use.

But Wine is open source. If something is missing, you can add it. Well, if you can write C, which I last did in university more than 20 years ago. But I have something better than C skills: debugging skills, and a stubborn refusal to give up. Well, energy drinks, and AI. Lots of AI.

The Debug Loop

My approach was brute force. Start the BC service tier under Wine with full debug logging enabled. Watch it crash. Find out which API call failed. Implement or fix that API in Wine. Repeat.

The first crash came immediately. Some localisation API wasn’t returning what BC expected. Easy fix. Then the next crash. And the next.

I kept two resources open at all times: Microsoft’s official API documentation and a decompiler targeting BC’s assemblies. The docs told me what an API was supposed to do. The decompiled code told me exactly how BC was using it. Just a matter of connecting the dots.

Some APIs were straightforward translations. Others required understanding subtle Windows behaviours that aren’t documented anywhere. Why does this particular call return data in this specific format? Because some Windows component, somewhere, expects it that way, and BC inherited that expectation.

Plus, it didn’t help that the Microsoft documentation is often incomplete and just includes placeholder info for some parameters and return values.

I even had to program my own Event Log because that Wine doesn’t have one. So the entire task was just as much a tooling test as a programming one. I created loads of scripts to iterate over and filter out just the logs I needed.

Getting It to Start

Before the service could even stay running, several hurdles arose that had nothing to do with Wine’s API coverage.

SQL Server encryption was an early roadblock. But not because it didn’t work, it was just a hassle to setup. BC insists on encrypted database connections, but the PowerShell cmdlets that normally configure certificates and connection strings don’t run on Linux. I had to reverse engineer what the cmdlets actually do and replicate each step manually.

The same problem hit user management. New-NavServerUser flat out refuses to work without Windows authentication. The cmdlet checks for valid Windows credentials before it does anything else. No Windows, no user creation.

My solution was pragmatic: bypass the cmdlets entirely. I wrote code that injects NavUserPassword users directly into the SQL database. BC stores passwords hashed exactly 100,001 times. Yes, that specific number. Finding that took longer than I’d like to admit.

Kerberos support in Wine was incomplete for the authentication modes BC wanted. Specifically, the SP800-108 CTR HMAC algorithm wasn’t implemented in Wine’s bcrypt. BC uses this for certain key derivation operations, so I had to add it. Again, it was just a matter of seeing in the logs what BC expected and making Wine do that.

When It “Worked”

After a week of this, something happened. The service started. It stayed running. I called an OData endpoint and got… HTTP 200. Success! Sort of.

The response body was empty. And after that first request, the service froze completely.

What was going on?

The HTTP.sys Rabbit Hole

BC uses Windows’ kernel-mode HTTP server (HTTP.sys) for its web endpoints. Wine had a partial implementation, but “partial” is generous. Looking at the httpapi.spec file, I counted 13 functions that were just stubs: HttpWaitForDisconnect, HttpWaitForDisconnectEx, HttpCancelHttpRequest, HttpShutdownRequestQueue, HttpControlService, HttpFlushResponseCache, HttpGetCounters, HttpQueryRequestQueueProperty, HttpQueryServerSessionProperty, HttpQueryUrlGroupProperty, HttpReadFragmentFromCache, HttpAddFragmentToCache, and HttpIsFeatureSupported didn’t even exist.

Wine’s HTTP.sys could accept connections and start processing requests. It just couldn’t reply with a body payload, finish them properly or clean up afterwards. The server literally didn’t know how to release a connection once it was established. That’s why it froze after the first request.

I had to implement actual connection lifecycle management: the IOCTL handlers for waiting on disconnects, cancelling requests, properly sending response bodies with the is_body and more_data flags. Server software needs to close connections cleanly. Games don’t care about that or they used different APIs.

I also had to resort to extensive Wireshark tracing to see what BC expected at the network level. Once I saw the raw HTTP traffic, it was easier to identify the missing pieces. So I compared traffic from a Windows BC instance to a Wine one and identified what was missing or malformed. Then went back to the Wine code and fixed it.

Actually Working

Once the HTTP.sys fixes were in, responses actually came back with content. The freezing stopped.

That first real API response with actual data felt like winning the lottery. Until I noticed the response was always the same. As I had just put a fixed response in the handler to test things. Took me an hour to realise I was looking at my own test code’s output, not BC’s.

Once I removed my test code and let BC handle the responses properly, it actually worked. The web client isn’t functional yet, but that wasn’t the main goal. The core is there: compile extensions, publish apps, run tests. That’s what I was after. Heck, a Hello World which showed the code ran on Linux was enough for me at that point.

Directions EMEA 2025 Presentation

Last year, before Directions EMEA, Stefan Maron reached out. He had heard about my Wine experiments and wanted to collaborate on a presentation. We teamed up and put together a talk showing the journey, the technical details, and a live demo. Well, we skipped the live demo part since doing live demos of experimental software is a recipe for disaster.

Once I had something functional, Stefan and I measured it properly. Same test app, same pipeline, Windows versus Linux.

The first working build: Linux finished in 13.4 minutes versus 18.4 minutes on Windows. That’s 27% faster out of the gate. Not bad, not bad at all.

After optimisation (smaller base image, certain folders in RAM, no disk cleanup overhead), Linux dropped to 6.3 minutes. Windows stayed around 18 minutes. 65% faster. But all this was on GitHub’s hosted runners, what if we could optimize further?

With caching on a self-hosted runner: 2 minutes 39 seconds total. At that point, we’d shifted the bottleneck from infrastructure to BC itself. Just pure service startup time, waiting for metadata to load was the limiting factor.

The container setup phase showed the biggest difference. Wine plus our minimal Linux image pulled and started in about 5 minutes. The Windows container took nearly 16 minutes for the same operation.

What Didn’t Work

The web client doesn’t work yet. I haven’t put much effort into it since it wasn’t the main goal. Last time I tried I had the web server running, but the NST and the web service just wouldnt talk to each other. Stopped there as Directions was coming up and I wanted to focus on the service tier.

The management endpoints don’t function. We had to write custom AL code to run tests via OData instead.

Some extensions that use uncommon .NET APIs crash immediately. If your extension does something exotic with Windows interop, it won’t work here.

What’s Next

This was always a proof of concept. The goal was to answer “can it be done?” and the answer is yes, with caveats.

Big disclaimer: This is purely a “see if I could” project. It’s not ready for production use, and I wouldn’t even recommend it for automated testing pipelines in its current state. It’s an experiment.

The code is up on GitHub.
Mine
BC4Ubuntu is my first try. Don’t use it, as it is messy and unoptimized.
wine64-bc4ubuntu has the custom Wine build.

Stefans
BCOnLinuxBase is the optimised base image.
BCDevOnLinux is the actual Dockerfile for BC. This is the one to use. But be careful, with great power comes great responsibility.

I’ve also got the NST running on ARM hardware. Getting SQL Server to work on ARM is an entirely different project for another time.

Would I run production on this? Absolutely not. But that was never the point.

Sometimes you learn the most by doing things the “wrong” way. But it was a fun ride.

And can you keep a secret? More than 98% of the code was written by AI. If I had done it today, the last 2% would have been included as well.

Stefan Maron contributed significantly to the pipeline work. This was very much a joint effort.

How I Benchmark LLMs on AL Code

When I started evaluating LLMs for Business Central development, I ran into a problem. The standard code generation benchmarks like HumanEval and MBPP measure Python performance. They tell you nothing about whether a model can write AL, the language used in Microsoft Dynamics 365 Business Central.

So I built CentralGauge, an open source benchmark specifically for AL code generation. This post explains the methodology, the challenges I encountered, and what I learned about how different models perform on AL code.

The Challenge of Domain Specific Code Generation

AL is a niche language. Unlike Python or JavaScript, there are far fewer AL code samples in the training data of most LLMs. This creates an interesting test: can models generalise their programming knowledge to an unfamiliar domain?

AL has several characteristics that make it distinct:

Syntax differences: AL uses procedure instead of function, begin/end blocks instead of braces, and has triggers that fire on database operations. The object model is unique with tables, pages, codeunits, reports, and queries.

Business Central conventions: Every AL object needs a numeric ID. Tables require proper field captions, data classification settings, and key definitions. Pages must reference a source table and define layout in a specific structure.

Standard library patterns: Working with records, assertions, and test frameworks follows Business Central conventions. The Record type, Assert codeunit, and page testing patterns are specific to this ecosystem.

These quirks mean a model might be excellent at Python but struggle with AL. Generic benchmarks cannot reveal this gap.

How CentralGauge Works

The benchmark follows a straightforward flow:

I defined 56 tasks organized into three difficulty tiers: easy, medium, and hard. Each task lives in a YAML file that specifies what the model should create, what tests must pass, and how to score the result.

The actual compilation and testing happen inside a Business Central container. This is the same environment developers use, ensuring benchmark tests are real-world viable rather than syntactic approximations.

I run multiple models in parallel against the same tasks. This enables direct comparison under identical conditions.

Task Design Philosophy

Creating good benchmark tasks requires discipline. The goal is to test whether a model knows AL, not whether it can follow instructions.

Here is a real task from the benchmark:

id: CG-AL-E001
description: >-
  Create a simple AL table called "Product Category" with ID 70000.
  The table should have the following fields:
  - Code (Code[20], primary key)
  - Description (Text[100])
  - Active (Boolean, default true)
  - Created Date (Date)

  Include proper captions and data classification.
expected:
  compile: true
  testApp: tests/al/easy/CG-AL-E001.Test.al

Notice what the task does NOT say: it does not explain that AL uses InitValue for defaults, or how to define a primary key. The model should know this. If it does not, that is a valid test failure.

Verifiable through real tests: Every task includes a test codeunit that runs in the BC container. These tests check that the generated code actually works, not just that it compiles. For example, a table test verifies that default values persist correctly after insert operations.

Clear requirements without ambiguity: I specify exact field names, types, and behaviours. Vague specifications like “create a useful table” produce unmeasurable results.

The Two Attempt Mechanism

A single attempt benchmark misses something important: the ability to self-correct. Real development involves fixing mistakes. I wanted to measure that capability.

Each task allows two attempts:

Attempt one: The model receives the task description and generates code from scratch.

Attempt two: If attempt one fails (compilation error or test failure), the model receives the error output and must provide a fix. The fix comes as a unified diff, forcing the model to reason about what specifically went wrong.

This mechanism reveals different model behaviours. Some models consistently pass on the first attempt. Others frequently fail initially but recover on attempt two, showing strong debugging capability. A few models fail both attempts, unable to either generate correct code or diagnose their mistakes.

The scoring reflects this: passing on attempt two incurs a 10 point penalty compared to passing on attempt one. The penalty is enough to differentiate first try success from eventual success, but not so severe that self correction becomes worthless.

Scoring Methodology

Fair comparison requires transparent scoring. I use a point based system:

A task passes when the score reaches 70 or higher with no critical failures. The attempt penalty subtracts 10 points per additional attempt needed.

This weighting reflects priorities. Compilation is foundational (50 points) because non compiling code provides zero value. Test passage validates correctness (30 points). Pattern checks (20 points combined) catch specific issues like missing required attributes or the presence of deprecated constructs.

The pattern checks serve a specific purpose. Some tasks require the model to use specific AL features (such as setting Access = Public on a codeunit). Others forbid certain patterns (like using deprecated syntax). These checks ensure the model demonstrates knowledge beyond “code that happens to work.”

Parallel Execution at Scale

Running benchmarks across multiple models introduces practical challenges.

Rate limiting varies by provider: Anthropic, OpenAI, and Google each have different quotas for requests per minute and tokens per minute. The benchmark respects these limits through a token bucket rate limiter that tracks usage per provider.

The BC container is a shared resource: Unlike LLM calls, which can run in parallel, compilation must be serialised. The container becomes unstable if multiple compilations run simultaneously. A FIFO queue ensures only one compilation happens at a time while parallel LLM calls continue.

Cost tracking enables comparison: Token usage and estimated costs are recorded per task per model. This reveals which models are cost effective for AL code generation versus which consume excessive tokens for marginal quality improvements.

What I Learned

Running benchmarks across multiple models revealed clear performance differences.

Opus 4.5 leads at 66%: Claude’s largest model achieved the highest pass rate, successfully completing two thirds of the benchmark tasks. Gemini 3 Pro followed at 61%, with Sonnet 4.5 and GPT 5.2 in the mid 50s.

The gap between top and bottom is significant: From Opus at 66% to the budget models around 37%, the spread is nearly 30 percentage points. This matters for production use where reliability is critical.

Self correction quality differs from generation quality: Some models generate mediocre first attempts but excel at debugging when given error feedback. Others produce good initial code but struggle to interpret compilation errors. The two attempts mechanism exposed these differences.

Cost efficiency varies dramatically: Gemini 3 Pro used four times as many tokens as Opus but cost five times less ($0.50 vs $2.77). Token pricing differs so dramatically between providers that the cheapest model per token is not necessarily the cheapest per task.

I publish the latest benchmark results at ai.sshadows.dk, including detailed breakdowns by task and model. The leaderboard updates as I test new models.

The Code and How to Contribute

CentralGauge is open source at github.com/SShadowS/CentralGuage. The repository includes all 56 benchmark tasks, the execution framework, and documentation for adding new tasks.

If you work with Business Central and want to contribute tasks, the format is straightforward. Define the task in YAML, write a test codeunit that validates the expected behavior, and submit a pull request. The benchmark improves as the task set grows to cover more AL patterns and edge cases.

Conclusion

Generic code benchmarks cannot tell you how an LLM will perform on your specific domain. AL code generation requires understanding Business Central conventions, object structures, and syntax patterns that differ from mainstream languages.

By building a dedicated benchmark with curated tasks, real compilation, and actual test execution, I can measure what matters: whether a model can produce working Business Central code. The two attempt mechanism adds nuance by measuring self correction alongside generation.

The results have been informative. Model ranking on AL tasks does not match generic benchmark rankings. Cost per completion and self correction ability all vary in ways that affect practical utility.

If you are evaluating LLMs for Business Central development, or any niche domain, consider building similar targeted benchmarks.

See you next Friday for another write-up

Native AL Language Server Support in Claude Code

If you’re using Claude Code for Business Central development, you’ve probably noticed that while it’s great at writing AL code, it doesn’t truly understand your project structure. It can’t jump to definitions, find references, or see how your objects relate to each other.

Until now.

I’ve built native AL Language Server Protocol (LSP) integration for Claude Code. This means Claude now has the same code intelligence that VS Code has: symbol awareness, navigation, and structural understanding of your AL codebase.

Wait, didn’t you already do this?

Yes! A few months ago I contributed AL language support to Serena MCP, which brought symbol-aware code editing to Business Central development. Serena works with any MCP-compatible agent: Claude Desktop, Cursor, Cline, and others.

This native Claude Code integration is different. Instead of going through MCP, it hooks directly into Claude Code’s built-in language server support. The result is a more polished, seamless experience specifically for Claude Code users.

Serena MCP: Universal, works everywhere, requires MCP setup
Native LSP: Claude Code only, tighter integration, zero-config once installed

If you’re using Claude Code as your primary tool, the native integration is the way to go. If you switch between different AI coding assistants, Serena gives you AL support across all of them.

What is this?

The AL Language Server is the engine behind VS Code’s AL extension. It’s what powers “Go to Definition”, “Find All References”, symbol search, and all the other navigation features you use daily.

By integrating this directly into Claude Code, the AI assistant now has access to:

Document symbols: all tables, codeunits, pages, fields, procedures in a file
Workspace symbols: search across your entire project
Go to Definition: jump to where something is defined
Go to Implementation: jump to implementations
Find References: see everywhere something is used
Hover information: type information and documentation
Call hierarchy: see what calls what, incoming and outgoing
Multi-project support: workspaces with multiple AL apps work fully

This isn’t regex pattern matching. This is the actual Microsoft AL compiler understanding your code.

Why does this matter?

Without LSP, Claude Code treats your AL files as plain text. It can read them, but it doesn’t understand the relationships between objects. Ask it to “find all places where Customer.”No.” is used” and it has to grep through files hoping to find matches.

With LSP, Claude can ask the language server directly. It knows that Customer is a table, that "No." is a field of type Code[20], and it can find every reference instantly.

The difference is like asking someone to find a book in a library by reading every page versus using the catalog system.

Real example

Here’s what Claude Code can do with LSP on a Customer table:

Go To Definition - On CustomerType enum reference at line 77:
→ Defined in CustomerType.Enum.al:1:12

Hover - Same position shows type info:
Enum CustomerType

Document Symbols - Full symbol tree for Customer.Table.al:
Table 50000 "TEST Customer" (Class) - Line 1
  fields (Class) - Line 6
    "No.": Code[20] (Field) - Line 8
      OnValidate() (Function) - Line 13
    Name: Text[100] (Field) - Line 22
    "Customer Type": Enum 50000 CustomerType (Field) - Line 77
    Balance: Decimal (Field) - Line 83
    ...
  keys (Class) - Line 131
    Key PK: "No." (Key) - Line 133
    ...
  OnInsert() (Function) - Line 158
  OnModify() (Function) - Line 168
  UpdateSearchName() (Function) - Line 190
  CheckCreditLimit() (Function) - Line 195
  GetDisplayName(): Text (Function) - Line 206

Every field with its type. Every key with its composition. Every procedure with its line number. Claude can now navigate your code like a developer would.

Requirements

Claude Code 2.1.0 or later. Earlier versions have a bug that prevents built-in LSPs from working.
VS Code with AL Language extension. The plugin uses Microsoft’s AL Language Server from your VS Code installation.
Python 3.10+ in your PATH
A Business Central project with standard AL project structure and app.json

Installation

Step 1: Enable LSP Tool

Set the environment variable before starting Claude Code. This is because even as LSPs are now supported, I think they are not production-ready in all instances, hence the active activation:

# PowerShell (current session)
$env:ENABLE_LSP_TOOL = "1"
claude

# PowerShell (permanent)
[Environment]::SetEnvironmentVariable("ENABLE_LSP_TOOL", "1", "User")

# Bash
export ENABLE_LSP_TOOL=1
claude

Step 2: Install the Plugin

Run claude
/plugin marketplace add SShadowS/claude-code-lsps
Type /plugins
Tab to Marketplaces
Select claude-code-lsps
Browse plugins
Select al-language-server-python with spacebar
Press “i” to install
Restart Claude Code

That’s it. The plugin automatically finds the newest AL extension version in your VS Code extensions folder.

Repository: github.com/SShadowS/claude-code-lsps

What’s next?

The current wrapper is Python-based. A few things I’m looking at:

Go-compiled binaries for faster startup and no runtime dependencies
Better error handling for more graceful recovery when the language server hiccups
Testing on more setups with different VS Code versions and extension configurations

Try it out and feedback

If you’re doing BC development with Claude Code, give this a try. The difference in code navigation and understanding should be significant.

I’d love to hear your feedback. What works, what doesn’t.

If you make an issue on Github please add the %TEMP%/al-lsp-wrapper.log as it helps me alot during debugging. This file will be disabled in a few weeks, just need it here in the beginning.

Repository: github.com/SShadowS/claude-code-lsps

This is part of my ongoing work on AI tooling for Business Central development. See also: CentralGauge for benchmarking LLMs on AL code, and my MCP servers for BC integration.