Look, most of the AI testing tools I cover on the TestGuild Automation Podcast share two things in common: they’re proprietary, and they only work with Playwright.

That’s fine if you’re starting a greenfield project in 2026. But what about the millions of test lines already running on Selenium? What about Appium mobile suites that teams have maintained for years?

That’s the gap Alumnium fills, and it’s the reason I think it’s one of the most practically useful open source projects in the testing space right now.

I sat down with Alex Rodionov, the creator of Alumnium, on the TestGuild Automation Podcast to dig into how it works, why he built it on top of existing frameworks instead of replacing them, and whether it’s actually ready for teams running real test suites.

Here’s what I learned.

What Makes Alumnium Different From Every Other AI Testing Tool?

Most AI-powered test automation tools in 2025 fall into one of two buckets:

Bucket 1: Vendor platforms — proprietary, subscription-based, Playwright only. They want you to migrate your entire workflow to their ecosystem. If the tool changes pricing or shuts down, you’re stuck.

Bucket 2: Code generators — AI writes Selenium/Playwright code for you upfront. Great for new projects, but once the UI changes, you’re back to maintaining the same brittle selectors you always had.

Alumnium is neither. It’s:

That combination doesn’t exist anywhere else in the open source world right now. There are web agents. There are Playwright-specific AI tools. But cross-framework, open source, runtime AI that also covers mobile? Alex told me he believes Alumnium is the only open source project doing all of that together.

Not sure what Automation Tool to use? Try Out Test Tool Matcher

What Is Alumnium?

Alumnium (pronounced “al-oo-MIN-ee-um” — Alex will correct you) is an open source AI layer you add on top of Selenium, Playwright, or Appium.

Instead of replacing your existing framework, it plugs in and gives your tests AI-powered interaction and assertion capabilities.

The core idea: you tell Alumnium what to do in plain English, and it figures out how to do it at runtime using an LLM.

python
al.do("type 'selenium' into the search field, then press Enter")
al.check("search results contain selenium.dev")

That’s your test. No selectors. No page objects. No locators to maintain.

Alex spent a year and a half building this after living the pain of maintaining hundreds of tests at scale. His first requirement: anything new had to have a migration path. As he told me on the podcast:

“If somebody would come to me and say, hey, I built this new tool, I would say, OK, how do I start using it in my 500 tests? I need a path forward. I don’t want to lock into some tool that just came out and force me to rewrite a bunch of my tests.”

That constraint shaped everything about how Alumnium works.

How Does Alumnium Work?

Accessibility Tree, Not Screenshots

When you call al.do() or al.check(), Alumnium captures the accessibility tree snapshot of the current page, feeds it into an LLM via a structured prompt, and receives back a list of tool calls — type this text, click this element, and so on.

Why the accessibility tree instead of screenshots? Alex explained the problem clearly: vision models have to map instructions back to pixel coordinates, which can be inaccurate. You also can’t use just any LLM for coordinate-based computer use —specialized models (like Anthropic’s Computer Use) were built specifically for that. But text? Any capable LLM handles it precisely, and the result maps directly to the actual DOM elements you need to interact with.

The Caching Layer

One of the smarter pieces of the architecture: Alumnium caches the elements it uses during execution. The next time it hits the same page and performs the same actions, it checks whether those cached elements still exist before making any LLM call. If they do, it skips the LLM entirely and runs at native Selenium/Playwright speed.

So the performance overhead people worry about? For repeated test runs, it largely goes away.

Two Modes of Operation

Library mode: Install the package, instantiate the Alumni class inside your existing Playwright or Selenium test, and pass it the browser object. Call al.do() and al.check() wherever you want AI to take over. Everything else stays exactly as it was.

MCP mode: Run Alumnium as a lightweight MCP server, write your test as a plain markdown file (just English steps), and execute it via Claude Code from the CLI. Claude Code handles the high-level planning; Alumnium handles the actual browser interactions as a sub-agent.

Why Not Just Use Playwright’s Built-In AI Features?

Good question. Playwright released some AI-assisted tooling including a self-healer.

The key difference: those tools are about writing and updating tests, not running them.

If your UI changes and a selector breaks, you trigger the healer agent manually, it updates the test, and you commit.

Alumnium’s approach is different. There are no selectors to break in the first place. It resolves what to click at runtime, every time. Self-healing is baked into the execution, not a separate repair step.

The Context Rot Problem (And Why It Matters for AI Testing)

This was one of the most interesting parts of my conversation with Alex — something he calls “context rot.”

A lot of people think the path to AI-powered testing is just connecting Playwright MCP to Claude Code and telling it to test your app. It works great for simple, short scenarios. But here’s the problem: on every page navigation, Playwright MCP captures the full accessibility tree snapshot and sends it back to the LLM. Run a 100-step test and that context window fills up fast.

Claude’s context window is 200,000 tokens, but here’s what Alex told me directly about where things fall apart:

“Roughly at 40% [of the context window] it starts getting way worse at instruction following. It starts forgetting what you told it to do. It starts skipping instructions completely.”

That’s roughly 80,000 tokens — anything beyond that and you can’t reliably trust what the model does. The model starts forgetting cleanup tasks. It forgets to generate a report at the end of a test run. It skips steps entirely.

The architectural answer: don’t try to do it all in one agent. Alumnium’s MCP mode splits the work. Claude Code handles planning and high-level thinking at the top level. Alumnium handles the actual browser interactions as a sub-agent running a cheaper, faster model. The result: a 610-task real-world run cost approximately $5, compared to what a naive single-agent approach using frontier models would cost — potentially 40-50x more.

What Are the Real Benefits?

Alex summed up the three main reasons to add Alumnium to your tests:

  1. You write less code. No selectors, no page objects, no locator maintenance. You call al.do() with a plain English instruction.
  2. Tests are more resilient. The AI resolves elements at runtime. UI changes that would break selector-based tests just… don’t break Alumnium tests, because there was never a hard-coded locator to begin with.
  3. Cross-platform tests in plain English. With the markdown/MCP mode, you can write one test description and run it on Chrome, iOS, and Android — as long as Selenium/Playwright/Appium is available underneath. You never specify how to interact, so the same test instruction works across platforms.

Why Alumnium Won’t Blow Up Your LLM Bill

Token cost is the elephant in the room with every AI testing conversation. I’ve been running my own AI agents and the bills add up fast — hit your context limit, trigger a compaction, rinse and repeat across hundreds of tests and suddenly you’re wondering if AI testing is actually cost-effective at all.

Alumnium addresses this in three concrete ways:

1. It’s designed for low-tier models. Most of what Alumnium does — reading an accessibility tree, figuring out which element to click — doesn’t need a frontier model. Alex built it to work well with the cheapest, fastest models available. A single test run costs less than $0.01. That’s not a marketing claim, that’s the architecture working as intended.

2. The caching layer eliminates repeat LLM calls. The first time Alumnium interacts with a page, it stores which elements it used. On subsequent runs, it checks whether those cached elements are still present before making any LLM call at all. If they are, it skips the LLM entirely and runs at native Selenium or Playwright speed. You only pay for AI when you actually need it.

3. The dual-agent architecture keeps costs in check at scale. This is the part that really got my attention. Alex ran 610 real-world tasks for approximately $5. How? By splitting the work intelligently: Claude Code handles high-level planning as the main agent (used sparingly), while Alumnium acts as a sub-agent running a cheap model that costs 40–50x less than a frontier model. Compare that to a naive single-agent approach — dumping everything into one context with a top-tier model  and you could be looking at hundreds of dollars for the same workload.

The bigger insight here isn’t just cost. It’s that keeping the sub-agent’s context small and focused is what makes the whole system reliable. You’re not just saving money — you’re avoiding the context rot problem at the same time.

Cheap and more trustworthy. That’s a rare combination.

Honest Limitations

Alex didn’t oversell this, and neither will I.

Non-determinism is real. LLMs aren’t deterministic. Give the same prompt 10 times and one run might behave differently. If your test suite depends on strict, zero-deviation regression behavior, you’ll need to do prompt engineering work to get Alumnium stable enough to trust.

Performance overhead exists. Every action that isn’t cached requires an LLM call. Even with caching, initial runs are slower than native Playwright or Selenium execution. Whether that trade-off is worth it depends on what you’re testing.

No Java client library. If your team is Java-first, you can still use Alumnium in markdown/MCP mode, but the in-code library integration isn’t there yet.

Trust is a mindset shift. The most common pushback Alex hears is: “How do I know the AI actually did what I told it to do?” He’s honest that this takes some adjustment, and some engineering investment in prompt design, to get past. He shared a telling example from his own experience:

“I was surprised how changing a single word can completely get the model off track. You change a single word in the prompt, and now suddenly the model thinks about it differently.”

Prompt engineering for testing isn’t a solved problem. It’s a skill you build over time, and what works for Claude may not work the same way for GPT.

Is It Ready for Production?

Alex’s take: mature enough to start using in your existing test suite, not ready to replace thousands of stable regression tests overnight.

Companies are using it now in different ways — some writing purely markdown-based tests, some using it just to avoid writing Appium selectors, some building cross-platform test coverage they couldn’t achieve before. It’s stable enough to use actively, but it’s moving fast. The MCP mode didn’t even exist six months ago. Alex expects the tool to keep evolving quickly.

How to Get Started

Alumnium runs on your own LLM API key — OpenAI, Anthropic, Google Gemini, Meta Llama, DeepSeek, and others are all supported. A single test costs less than $0.01. It’s a separate cost from any Claude Code subscription.

Install it:

bash
pip install alumnium

Add it to your existing Playwright test:

python
from alumnium import Alumni

al = Alumni(page)  # pass your Playwright page object
al.do("fill in the login form with test credentials")
al.check("dashboard is visible")

Or set up the MCP server for Claude Code:

bash
claude mcp add alumnium --env OPENAI_API_KEY=... -- npx alumnium mcp

Then write your test as a markdown file and run it from the CLI.

Full documentation and the GitHub repo are at alumnium.ai. There’s also an Alumnium channel in the Selenium Slack if you want to connect with other users.

Frequently Asked Questions

What is Alumnium? Alumnium is an open source AI layer for test automation. It works with Selenium, Playwright, and Appium — adding AI-powered interactions and assertions at runtime without replacing your existing framework. Most AI testing tools only support Playwright and are proprietary; Alumnium supports all three major frameworks and is MIT licensed.

Does Alumnium replace Selenium or Playwright? No — and that’s the point. Alumnium is built on top of them. If you have 500 existing Selenium tests, you don’t rewrite them. You add Alumnium incrementally, test by test, and see how it performs before committing further.

Does Alumnium work with Selenium or only Playwright? Both. Alumnium works with Selenium, Playwright, and Appium. This is one of its key differentiators — most AI testing tools in 2025 only support Playwright.

What LLMs does Alumnium support? Alumnium works with OpenAI, Anthropic Claude, Google Gemini, Meta Llama, DeepSeek, Mistral, and others. You supply your own API key.

Does Alumnium work on mobile? Yes. Alumnium supports Appium and has been successfully used for iOS and Android testing. The accessibility tree exists on both mobile platforms, making the same approach work across web and mobile.

How much does it cost to run Alumnium? A single test typically costs less than $0.01. Alumnium is designed to work with low-tier, cheaper models rather than expensive frontier models. Its internal caching skips LLM calls entirely on repeated runs when elements haven’t changed. At scale, a dual-agent approach — Claude Code for planning, Alumnium as a sub-agent with a cheap model — ran 610 real-world tasks for approximately $5. A naive single-agent approach using a frontier model for the same workload could cost 40–50x more.

Does Alumnium help reduce token consumption? Yes, in three ways: it uses low-tier models by design, it caches element interactions to avoid repeat LLM calls, and its sub-agent architecture keeps context windows small and focused. Smaller context = lower cost and better instruction-following reliability.

What is context rot in AI testing? Context rot is the degradation in LLM instruction-following that happens when a context window gets too full. In AI testing, sending large accessibility tree snapshots to an LLM on every step of a long test can saturate the context and cause the model to forget instructions, skip steps, or behave unpredictably.

Is Alumnium production-ready? As of 2025, Alumnium is stable enough for active use in test suites, particularly for new tests and incremental adoption. It’s still evolving quickly and isn’t a drop-in replacement for mature, large regression suites without engineering investment.

The Bottom Line

The reason I think Alumnium is worth your attention isn’t just the tech — it’s the philosophy. Alex built it with a migration path in mind, because that’s the real problem. Every other tool in this space seems to want you to start from zero. Alumnium asks: what if you didn’t have to?

If you have an existing Selenium or Playwright suite and you’ve been wondering where AI actually fits in without blowing everything up, this is a reasonable place to start experimenting.

Listen to my full conversation with Alex Rodionov on the TestGuild Automation Podcast to hear him walk through the context rot problem, the dual-agent cost architecture, and his read on where AI testing is heading.


Have you tried Alumnium? I’d love to hear how it worked for your team. Drop a comment below or join the conversation in the TestGuild Heartbeat community.

Join the Guild