Bottom Line: Kobiton is the first real device testing platform I’ve seen that makes AI-powered mobile testing feel like it belongs in your actual pipeline, not just a demo. The Claude MCP integration is genuinely useful. The catch: AI-generated tests still need a human who knows what they’re checking for ,and the Kobiton team are the first to tell you that. Best for: Mobile teams on CI/CD who want to run AI-assisted tests against real devices without leaving their IDE. Not ideal for: Teams that need device-level risk analysis or long-term test history out of the box today.

I wasn’t expecting to be impressed.

I’ve been doing this long enough, 25 years and 500+ interviews , to be skeptical when a vendor shows me an AI demo. They always run clean. The demo environment is never production. And “AI fixes your tests automatically” has been promised before.

Then during my conversation with Frank Moyer and Chris Faulhaber from Kobiton, something happened that I hadn’t seen before.

Chris was running a test through their Claude MCP plugin against a real device. The test failed on a small-screen device , the keyboard had popped up and covered some elements.

Claude looked at the failure, examined the application source code, checked the device screen dimensions, and without any prompting whatsoever proposed the exact fix: insert driver.hide_keyboard commands at the precise spots in the script. Subsequent runs passed.

Chris wasn’t directing it. He was watching it work.

That’s a different kind of demo.

What Kobiton Actually Is in 2026

Kobiton started as a real device testing platform, the kind that gives you actual Android and iOS hardware in a data center to run your Appium tests against, rather than simulators. That core business is still there and still valuable.

What’s new in 2026 is the AI layer. They’ve shipped a Claude MCP plugin that lets your AI coding agent , Claude Code, Cursor, GitHub Copilot, Codex, whatever you use , communicate directly with the Kobiton device cloud.

You write a test, or ask the agent to write one, and it runs on a real physical device. You stay in VS Code the whole time.

Frank framed the bigger picture clearly:

“You have to figure out how you’re going to use it and report back to me how it’s going to change the productivity within your team.” That’s not vendor hype, that’s what he’s hearing from Fortune 100 executives right now. AI testing isn’t optional anymore. The question is which tools make it real.

Try Kobiton MCP Now

What It Does Well

Real Devices, Not Just Simulators

This is the thing every other AI coding agent integration gets wrong. Playwright MCP runs against browsers.

Most AI testing tools default to simulators. Simulators don’t show you what happens on a Samsung Galaxy with a cramped screen, or an older iOS version with a quirky keyboard behavior, or hardware that behaves differently under load.

Chris put it well:

“It’s almost a completely different world. It’s like the old days of PCs when there was lots of instability.” Real devices catch the bugs that simulators don’t. That’s the whole point of device testing, and the Claude MCP plugin preserves it.

The demo showed a complete workflow: Claude checks the Kobiton app repository, verifies the test device is available, adjusts the script to target the real device, opens the browser to show the live session, runs the test, and reports back all from within VS Code.

The only time Chris left the IDE was to glance at the device session in the browser.

Natural Language Selectors for the Hard Cases

If you’ve ever had to test an embedded PDF, a canvas-rendered app, an infinite scroll list, or anything without real DOM locators, you know what a nightmare that is to automate.

Kobiton just shipped a feature that addresses this directly: you can inject a natural language description into your Appium script and the system finds the element based on that description.

Chris demonstrated it live , finding a “reset button” with a plain English description, no selector, and it worked. He mentioned this runs on a local model on his first-generation M1 MacBook at under a second per call, near-zero cost. For teams with a private model installation, the cost is essentially nothing.

Frank’s example: a customer trying to interact with an embedded PDF. No standard locators exist. With natural language selectors, they can describe what they want to interact with and the system figures it out. That’s a real problem solved.

They Engineered Out the Token Cost Problem

This one’s worth flagging because it’s a genuine differentiator. Most platforms that bolt AI onto testing pass the token costs through to you  or charge a premium. Kobiton built IP on the back end specifically to optimize token usage, and their current pricing doesn’t add anything extra for the AI capabilities.

Frank was direct:

“Our competitors are charging extra and we figured out how to do it without token consumption.” That’s not a feature on a roadmap. It’s already in the product.

Where It Has Real Limits

AI-Generated Tests Will Fool You If You’re Not Paying Attention

Frank said it straight: “AI slop isn’t just in the content you read, it’s in the code that gets generated.”

And Chris backed it up with specifics: “I’ve seen AI-generated tests generate tests that completely miss the point of the feature and just test truisms, like I assigned the value foo to five. Is foo five? Yes it is. Paddle to back. This works.” He added that reviewing those tests is consuming a significant chunk of his time — tests that pass, look legitimate, and don’t actually verify anything meaningful.

This isn’t a Kobiton problem specifically, it’s a category problem.

But it’s worth saying clearly: you cannot turn AI-generated test authoring on, walk away, and trust the results. Someone with a testing mindset needs to review what the AI wrote and confirm it’s actually checking what matters.

Kobiton’s governance and remediation features (Frank said these were built into the product since 2019) help, but they don’t replace that judgment.

Device-Level Risk Analysis Is Still Coming

When I asked whether the system could recommend which device/OS combinations to test first based on historical failure patterns — essentially risk-based testing at the device level , Frank said: “That’s one of the things we’re working on.” It’s a roadmap item, not a current feature.

If you’re managing a large device matrix and you need intelligent prioritization based on where failures historically cluster, that’s not in the product today.

Long-Term Test Memory Is a Work in Progress

I pushed on something that I think is going to matter more over time: can the system remember that a specific test failed on a specific device combination two weeks ago? Can it build institutional memory across test runs?

Frank acknowledged this is an active area of work: “We have and continue to work on that long-term memory challenge.” The system of record that tracks test history and performance over time exists, but the richer AI memory layer , being able to reason across many past runs , is still evolving.

Who It’s For

This makes sense if you’re:

It’s probably not the right fit if you:

Joe’s Take

I’ve heard “AI is changing testing” so many times that I’ve learned to ignore the noun and listen for the verb. What’s actually changing, specifically, and for whom?

What Frank and Chris showed me is specific: a Fortune 100 company went from shipping 1,000 code changes a month to 10,000. The testing team can’t manually keep up with that rate of change. Without tooling like this, you either slow down engineering or you ship untested code. That’s not a hypothetical — it’s happening now.

The Claude MCP plugin is the most honest integration between an AI coding agent and real-device testing I’ve seen. It doesn’t pretend the problem is solved. Chris showed me it failing and recovering. Frank told me directly that AI code generation produces garbage tests you need to catch. That kind of honesty, from a vendor, is more valuable to me than a flawless demo.

The limits are real. Review your AI-generated tests. Don’t expect device risk intelligence today. Know that long-term memory is still being built. But if your team is drowning in the gap between how fast developers are shipping and how fast your test coverage can keep up, this is worth a serious look.

FAQ

Does Kobiton work with AI coding agents besides Claude? Yes. The MCP plugin works with any tool that supports the MCP specification — that includes Cursor, GitHub Copilot, Codex, Gemini, and others. The demo used Claude Code, but Chris confirmed it’s not Claude-specific.

How does Kobiton handle the cost of running AI features? They’ve built token optimization into the backend and are not currently charging extra for AI capabilities — it’s included in running Appium scripts on the platform. Chris also demonstrated a natural language selector running on a local model (first-gen M1 MacBook, under a second per call) as a near-zero-cost option for teams with private model deployments.

Is AI-generated mobile testing ready for production CI pipelines? The honest answer from both Frank and Chris: yes, with supervision. The tool can generate and run tests on real devices autonomously. But AI-generated tests frequently test truisms rather than meaningful behavior, and someone with testing knowledge needs to review what the AI wrote. The platform has governance features, but human judgment is still required.

What kinds of apps is Kobiton especially useful for? The natural language selector feature makes it particularly useful for apps that are historically hard to automate: embedded PDFs, canvas-rendered applications, infinite scroll, and any UI with dynamic or missing locators. Traditional Appium locator-based approaches struggle with these; the natural language layer handles them.

How does running tests on real devices differ from simulators for AI testing? AI can find bugs on real devices that simulators never surface — different screen sizes, OS-level quirks, hardware behavior, keyboard interactions, and performance differences. Chris’s example was instructive: Claude spontaneously found a keyboard-overlay bug on a small-screen device that would only appear on specific hardware. That’s the value of real device coverage that simulators can’t replicate.

Hear the full conversation on the TestGuild Automation Podcast — Claude AI Mobile Testing, Run Real Device Tests with AI EP 586

Disclosure: Kobiton is a TestGuild sponsor. As always, opinions are my own and I only work with tools I’d actually recommend.

Try Kobiton MCP Now