Guide

Autonomous Test Discovery: How It Actually Works

AI agents now explore apps on their own to find untested flows. Here's how autonomous test discovery works, where it wins, and what it still can't do.

Adithya Aggarwal
Adithya Aggarwal
CTO & Co-founder at Pie
11 min read

AI is generating code faster than engineers can specify what to test. Cursor, Claude Code, and GitHub Copilot are real. The scope of what needs testing expands with every feature. The team writing tests stays the same size.

Tools claiming “AI-powered discovery” still require humans to define test boundaries first. True discovery is different. You give agents eyes and intuition. They look at a screen, infer what each element is for, and decide what to explore next, not because someone mapped it for them, but because they can read the interface.

In this guide, you’ll learn:

  • Why mobile testing tools plateau at 30% coverage, and what breaks past it
  • What autonomous test discovery actually means vs. marketing claims
  • Why vision-based agents outperform DOM-based tools for discovery
  • How the multi-model pipeline behind vision testing works
  • What autonomous discovery still misses, and how to close those gaps
  • How Fi cut release validation from days to hours using autonomous discovery

The 30% Coverage Ceiling

Mobile UI testing tools plateau at roughly 30% activity coverage. That is the central finding of a January 2026 arXiv paper that benchmarked state-of-the-art automation across real apps. Seventy percent of flows go untested.

The number is specific to mobile, but the structural problem applies anywhere. Tools find the obvious, scripted paths fast. Conditional UI, auth-gated features, multi-step sequences, and locale-specific behavior require curiosity and context. Historically, only humans brought that to discovery.

The ceiling isn’t a bug in any particular tool. It is a consequence of how discovery has always worked. You start with what you know. You script what you found. You skip what you didn’t think to check. The flows you never discovered are the ones you don’t know you missed.

Autonomous discovery is built to solve that problem. Not faster execution of known tests. Actual discovery of flows nobody documented.

What Is Autonomous Discovery?

Autonomous discovery is the process by which AI agents explore an application on their own, identify what can be tested, and generate test cases automatically, without scripts, recorded sessions, or guided paths. The agents see the app the way a user does, infer intent from the interface, and produce semantic test scenarios as output.

Three mechanisms make it work, and the tools that claim autonomous discovery without all three are stretching the word.

1. Exploration

Agents navigate your application the way a user would. They click buttons, fill forms, follow links, handle popups. There is no predefined path. Each agent explores different routes through your app, building a map of possible user journeys.

2. Observation

As agents explore, they observe everything. Which elements appear on each screen, how the UI responds to interactions, what state changes follow each action. These observations become the raw material for test cases. An agent that notices clicking “Add to Cart” increased the cart count by one has just documented a testable assertion.

3. Generation

Structured test scenarios emerge from exploration and observation. The output isn’t “click at coordinates (340, 220).” It’s “Add product to cart and verify cart count increases.” Test cases are semantic, readable, and tied to actual user behavior rather than brittle implementation details.

Why Most AI Testing Tools Aren’t Truly Autonomous

Strip away the marketing, and the dominant approach still reads the DOM. HTML elements, CSS selectors, JavaScript state. The agent finds a button by its class name. Rename the class? Test breaks. Restructure components? Tests break. Switch frameworks? Start over.

Three failure modes are baked into the DOM-based approach.

1. Selector fragility is built into every test

When discovery depends on selectors, fragility lives inside every test. An agent finds a button with id="submit-btn". A developer renames it id="checkout-submit". Discovery becomes rediscovery becomes maintenance. A React refactor that changes nothing functional can break hundreds of tests overnight.

2. Crawlers see pages, not user journeys

Crawling URLs generates impressive coverage reports, but pages aren’t behavior. A checkout flow spans five pages. Login-protected features require authentication state. Multi-step forms need sequential completion. Crawlers see structure. They miss what the user actually does.

3. State blindness breaks anything past the login screen

Authenticated flows, conditional UI, and multi-step processes all require memory. What happened before affects what happens next. Tools that see pages in isolation have no understanding of session state or user context. State is one of the genuinely hard problems in autonomous discovery, and most tools don’t acknowledge it.

How Pie Does It Differently

Humans look at a screen. They see a “Sign In” button. They know it’s a button because it looks like one and because the context makes the intent obvious. They click it without knowing its class name.

So we gave tests the same thing. Eyes.

Our agents see applications the way users do. Computer vision, not DOM inspection. They look at rendered screens, identify elements visually, and understand context from what’s displayed. Vision-based discovery changes what’s possible:

  1. Framework-agnostic. React, Vue, Angular, Rails, that legacy app nobody wants to touch. If it renders, we can explore it.
  2. Handles the unexpected. Cookie banners, chat widgets, promo popups. Agents do what humans do. See it. Handle it. Keep going.
  3. Parallel exploration at scale. Humans explore one path at a time. Pie deploys hundreds of parallel agents. On a complex consumer app, a single session can involve over a thousand discrete agent actions. Days of manual exploration happen in 30 minutes.
  4. Self-healing. UI changes, buttons move, fields get reordered. Visual understanding persists. Self-healing tests adapt automatically because they never encoded a locator to break.

The Pipeline Behind Vision Testing

Vision gets you to the door. The pipeline is what gets you inside CI.

The naive approach is one model end to end. Take a screenshot, send it to a vision model, get back coordinates, tap. Latency: 300ms per step. At 15-20 steps per test case, that’s 5-6 seconds before assertions. At scale, unacceptable.

1. A specialized pipeline, not a single model

Element location is one problem. Action determination is another. Deciding whether the screen is ready for the next step is a third. We built a pipeline where each stage uses a specialized model optimized for that task. The outputs combine before any action is taken. The pipeline is why it’s accurate. The specialization is why it’s fast enough to live inside a CI run.

2. LLM judgment replaces hardcoded waits

Most frameworks handle timing with sleep() calls. Wait two seconds after submitting a form, hope the confirmation appears. We replaced that with LLM judgment. After every action, the model looks at the current screenshot and decides whether the app is ready for the next step. Splash screen still animating? Wait. API response not populated? Wait. Alert appeared unexpectedly? Handle it. No hardcoded delays, no timing-based flakiness.

3. Session state runs in isolation per agent

A checkout flow requires an agent to stay logged in, remember what’s in the cart, and carry payment context across five screens. Parallel agents running simultaneously can’t share session state without interfering with each other. Our architecture gives each agent its own isolated session context.

We got this wrong early. An early session persistence feature caused test failures from state bleed between parallel runs. The failure taught us the right architecture.

How Fi Cut Release Validation from Days to Hours

Fi makes smart dog collars. GPS tracking, activity monitoring, escape alerts. Millions of dogs depend on their platform working. Reliability isn’t optional.

When Fi deployed autonomous discovery, agents explored across iOS, Android, and web. They found flows nobody had documented. Collar disconnects mid-walk. Language settings changing between sessions. Edge cases that existed but weren’t in any test plan.

Customer Result

“Release validation went from two to three days to just a few hours. The way Pie set up allowed Fi to work alongside development without changing processes.” — Philip Hubert, Director of Mobile Engineering, Fi

Philip’s team didn’t write new scripts. They didn’t map flows by hand. Agents explored the app, found what mattered, and Fi shipped faster without adding headcount.

In Pie’s onboarding data across consumer mobile apps, first discovery typically reaches 60–80% coverage within 30 minutes. That benchmark is a starting point, not a ceiling. Discovery expands over time as agents encounter more of the app.

See Discovery on Your App

Upload your APK or point us at your staging URL. Agents map your app in 30 minutes.

Book a Demo

Approaches Compared

Manual exploration, script-based automation, and autonomous discovery solve different problems. Comparing them at the same axes makes the trade-offs visible.

AspectManual TestingScript-Based AutomationAutonomous Discovery
Discovery MethodHuman explorationHuman writes scriptsAI agents explore independently
CoverageLimited by timeLimited to scripted paths60–80% on day one, expands over time
UI Change ImpactRe-explore manuallyFix broken selectorsSelf-corrects via vision
MaintenanceN/A30–40% of sprint timeZero
Edge CasesDepends on tester curiosityOnly if scriptedDiscovered automatically
Time to First CoverageDays to weeksWeeks to months30 minutes

The maintenance row lands hardest in practice. Script-based automation consumes 30-40% of sprint time keeping tests aligned with the UI. Not expanding coverage. Just maintaining what already exists. Vision-based agents adapt when the UI changes, and that overhead disappears.

What Discovery Still Misses

Autonomous discovery is the most thorough exploration approach available. It still has real limits, and understanding them makes it more effective.

  • Complex multi-factor authentication. Agents handle standard login and many OTP flows. Flows that require physical device confirmation, like hardware security keys or specific biometric gestures on real hardware, can’t be fully automated. Agents will reach the auth wall and stop.
  • Sensor-dependent gestures. Flows triggered by gyroscope interactions, ARKit/ARCore features, or pressure-sensitive touch are hardware-dependent. Agents can navigate taps and swipes, but can’t trigger sensor-specific events.
  • Account-state-gated flows. Some flows only appear after 90 days on the platform, with a specific subscription tier, or after a prior onboarding step. Agents start fresh and will miss these unless you pre-seed the state in your test environment.
  • Real payment flows. Flows that require live payment processor responses or real-money subscription states need special test account setup. Discovery maps what it can navigate.

The honest answer is that agents surface what they found and flag where their confidence gaps are. Teams can review the discovery map, identify low-coverage areas, and point agents at missed flows with the right context. Specific account states, credentials, seed data. Targeted rediscovery with better setup closes most gaps.

Discovery Is the Layer Worth Fixing

The gap widening in 2026 isn’t execution speed. CI pipelines handle that. The gap is discovery velocity. AI is generating code faster than humans can specify what to test, and every team still relying on manual test writing will feel it.

The 30% ceiling is a research finding, but it shows up in every team that has tried to scale script-based testing. Closing the gap requires agents that map what exists, not just run what engineers happened to document.

We built our autonomous testing platform because we hit every wall in this post. Selectors broke. Replays went stale. Crawlers found pages, not flows. Vision was slow until it wasn’t. The architecture that works today came from understanding what failed first.

If your test suite is more liability than asset, discovery is the layer worth fixing.

Try Autonomous Discovery

No test scripts. No selector maintenance. SOC 2 Type II certified.

See Pie in action

Frequently Asked Questions

In Pie's onboarding data across consumer mobile apps, first discovery typically reaches 60–80% coverage within 30 minutes. Subsequent runs are faster because agents accumulate knowledge of your app's patterns. Getting to 90%+ takes curation: you review what agents found and point them at flows they missed.

No. Upload your APK, zipped app bundle, or staging URL. Agents explore autonomously and generate test cases in plain English. You review and approve. Scripts exist for state setup (seeding data, setting subscription tiers) but are optional. Most teams start without them.

Agents handle authentication. Provide credentials once, and they navigate authenticated flows just like a real user would. Multi-step auth (OTP, biometrics bypass in test mode, session handoffs) is handled via App Instructions configured per app.

Same approach as web. Vision-based agents work on iOS and Android. We use Appium as the device driver to take screenshots and send taps, but no Appium scripts are written. The intelligence layer works entirely from visual coordinates. Flutter, React Native, native Swift, native Kotlin. All of it.

Yes. Agents accumulate knowledge of your app's specific patterns with every run. Subsequent runs are faster because the system has already mapped common paths and can focus on what's changed. Teams that have run Pie for a few months see noticeably faster cycles than teams in week one.

Timing is actually the biggest source of test flakiness, not selector changes. After every action, Pie checks whether the app is ready for the next step before proceeding. No hardcoded waits, no arbitrary sleep() calls. The agent waits for actual readiness. That eliminates an entire class of failures that selector-based tools handle by guessing.

Our QA team reviews every flagged finding before it reaches you. Real issues only. You can also train Pie by rejecting findings. Each rejection teaches the system that behavior is expected, preventing the same false positive from recurring.

Selenium and Playwright are execution frameworks. You write the scripts, they run them. Discovery is still entirely manual. Pie discovers what to test autonomously, generates the test cases, and executes them. No scripts to write. No selectors to maintain. Playwright is the underlying driver Pie uses for web execution, but it never surfaces to you.


Adithya Aggarwal
Adithya Aggarwal
CTO & Co-founder at Pie

Eight years building search and delivery systems at Amazon. The kind of scale where flaky tests block billion-dollar releases. Now CTO at Pie, building AI agents that adapt when your UI changes. LinkedIn →