Guide

Why Mobile E2E Tests Flake (and How to Stop the Cycle)

Mobile E2E tests fail for different reasons than web tests: emulator drift, gesture timing, system dialogs, and device fragmentation. Learn why script-based fixes compound the problem and what actually stops the cycle.

Dhaval Shreyas

CEO & Co-founder at Pie

10 min read

Posted May 27, 2026

Retries are not a fix. Neither are longer timeouts, quarantine lists, or that shared Slack channel where your team posts “known flaky, ignore.”

You already know this. What you may not know is why mobile E2E tests flake in ways that web tests simply don’t, and why every solution borrowed from the web playbook makes the problem structurally worse over time.

I spent 13 years building mobile infrastructure at Square, Facebook, and Instacart. In every engineering org, the same pattern played out: a team inherits an end-to-end test suite, adds retries when tests start failing, extends timeouts when retries aren’t enough, and eventually stops trusting the suite entirely. When CI goes red, nobody investigates. Nobody knows if it’s a real failure or just Tuesday.

The problem isn’t discipline. The problem is that mobile E2E tests have four failure modes that don’t exist on the web, and every web-first fix imports the wrong mental model.

According to the Bitrise Mobile Insights 2025 report, the proportion of mobile development teams experiencing test flakiness grew from 10% in 2022 to 26% in 2025, across more than 10 million builds over three years. Flakiness isn’t getting better as tooling matures. It’s getting worse as pipelines get more complex.

Understanding why requires starting with what’s actually different about the mobile environment.

What you’ll learn

Why mobile E2E flakiness is structurally different from web flakiness
Four failure modes that only exist in mobile test environments
Why script-level fixes (retries, timeouts, quarantine) compound the problem
How self-healing tests address the architectural root cause

Why Mobile Flakiness Is a Different Problem

Web automation runs inside a browser. The browser is a sandboxed, largely deterministic environment: the DOM is queryable, animations can be disabled, network requests can be intercepted, and the rendering engine behaves predictably across runs. When a web test flakes, the causes are well-understood: async wait issues, test order dependency, shared state pollution. They’re covered in every testing guide on the internet.

Mobile automation runs on top of a full operating system, and that difference changes every assumption borrowed from web testing.

A mobile app test has to contend with OS-level interruptions your test script never triggered. It has to deal with a gesture model where a 10-millisecond difference in tap timing can produce different results. It has to survive CI emulators that behave differently on run 47 than they did on run 1. And it has to produce consistent results across a device ecosystem where Android alone runs across thousands of manufacturers and dozens of active OS versions.

Borrowing web fixes for mobile flakiness is like using a brake repair manual to diagnose an aircraft hydraulics problem. The vocabulary overlaps. The physics don’t.

The fixes that work for web flakiness (standardize async wait patterns, isolate test state, enforce test order) are correct for the web causes. On mobile, they address maybe half the problem. The other half has different root causes that require different solutions.

Four Failure Modes Web Tests Never See

1. Emulator Behavioral Drift

CI pipelines run tests on emulators, not real devices. Emulators are software simulations of hardware. Like any software running in a resource-constrained CI environment, their behavior degrades over time within a single test run.

Early in a suite, the emulator is responsive. GPU acceleration is warm, memory is clean, the virtual CPU has headroom. Forty test cases later, the same emulator may be running under memory pressure, thermal throttling its virtual CPU, or dropping gesture events because the event queue backed up. Your test code is identical. The environment has drifted.

Web CI environments don’t have this problem at the same scale because browser instances are lightweight and stateless. An emulator is running a full mobile OS: camera drivers, Bluetooth stacks, network simulators, GPS emulation. All of it, even when your test only touches a login screen.

The symptom: tests that pass reliably at position 1-15 in the suite start failing inconsistently at position 30-50. Developers add retries. The retries occasionally work because the second attempt catches the emulator in a less-degraded state. The root cause (that the emulator is changing behavior mid-run) never gets addressed.

2. Gesture-Timing Sensitivity

Mobile interaction is fundamentally gesture-based. A tap is not a click. A swipe is not a scroll. The difference matters more than it sounds.

Browser clicks fire a discrete event (mousedown, mouseup, click) that the DOM handles predictably. Mobile gestures are streams of touch events with timing properties that the OS interprets contextually. A 100-millisecond hold is a tap. A 500-millisecond hold triggers a long-press context menu. A swipe that starts 5 pixels to the left of where the test expected to start may register as a scroll rather than a swipe on a specific device or OS version.

Frameworks like Appium expose gesture APIs, but the underlying timing behavior varies between iOS and Android, between OS versions, and between emulators and real devices. A gesture command that works reliably on your local simulator may fail on a CI emulator running on a different host machine with different virtualization characteristics.

Web tests don’t have this problem. The click event is abstracted at the browser layer. Mobile developers have no equivalent abstraction. They’re working directly with an OS gesture model that varies by implementation.

3. System Permission Dialogs

iOS and Android surface OS-level permission dialogs during app flows: camera, microphone, location, notifications, contacts, Bluetooth. Your test doesn’t trigger them. The operating system decides when to show them based on internal state.

When a permission dialog appears mid-test, it covers the UI. Your test script tries to tap an element that’s now obscured. The tap fails, the element can’t be found, and the test reports a failure that has nothing to do with the code under test.

The timing of these dialogs is not fully deterministic. A clean emulator state may not show the notification permission dialog on run 1, but show it on run 3 when the app has been installed long enough for the OS to surface it. Teams handle this by adding dialog-dismissal logic to test setup. OS behavior changes across iOS and Android versions, and setup scripts that work on one OS version fail silently on another.

Flaky tests in this category are particularly expensive to debug because the failure report shows the wrong element (the one the script was trying to tap), rather than the dialog that was blocking it.

4. Network State and Device Fragmentation

Mobile apps operate across network conditions that web apps largely don’t deal with: WiFi to cellular handoffs, backgrounding during network requests, carrier-specific DNS behavior, and variable latency that changes with signal quality.

A test that passes on a WiFi-only emulator may fail on a real device that switches to cellular mid-test. A test that handles a 200ms API response may fail on a CI environment with different network simulation settings. These aren’t async wait problems. They’re environment fidelity problems: the test environment doesn’t match the real operating conditions.

Device fragmentation compounds this further. Android’s ecosystem spans budget devices on Android 9 running on 1GB of RAM to current flagship hardware. A test designed against an emulator running Android 14 on a virtual high-spec machine will encounter render timing differences, memory management differences, and garbage collection behavior differences on real budget hardware. Screen density differences alone can cause tap coordinates that work on one device to miss their target on another.

Comprehensive mobile app testing requires accounting for this variance. Selector-based test suites do not. They assume the view hierarchy is stable across devices, which it often isn’t.

Stop debugging flaky mobile tests manually

See how self-healing tests eliminate the maintenance cycle. Watch Pie run your flows without selector updates.

Book a Demo

Why Script-Based Fixes Make It Worse

The standard response to mobile flakiness follows a predictable escalation. Retries first, because they’re the easiest lever. Then timeout increases, because more time usually helps emulator drift and gesture timing issues pass eventually. Then quarantine, because some tests are so reliably unreliable that the team stops expecting them to pass.

Each of these has a real cost that doesn’t show up in the CI dashboard.

Retries mask real failures. A test that passed on the third retry is a flaky test that happens to have gotten lucky. But it’s also potentially a real regression that your retry logic covered up. Google’s foundational 2016 research found that flakiness caused 84% of pass-to-fail transitions in their suite, which means teams relying on retries are regularly attributing real regressions to “just flakiness.” The confidence signal the test suite is supposed to provide stops being trustworthy.

Timeout increases make suites slower without fixing root causes. A gesture timeout extended from 3 seconds to 8 seconds helps emulator drift situations where the emulator needed more time to process the gesture event. But it also adds 5 seconds to every run of that test. A suite with 200 tests that each grew by 3 seconds of timeout buffer takes 10 extra minutes to run. Longer feedback cycles mean developers wait longer to discover failures, which means they’ve already moved on to the next task when the failure surfaces.

Quarantine defers the problem to an indefinite future. “We’ll fix the quarantined tests in the next sprint” almost never happens, because fixing them requires understanding the root cause, and understanding the root cause requires environment investigation that takes longer than writing new tests. Quarantine lists grow. The quarantined tests represent real coverage gaps, flows that aren’t being validated in CI. Those gaps are where regressions hide.

The underlying issue with all three approaches is the same: they’re treating a structural problem as a maintenance problem. The flakiness isn’t happening because your team isn’t thorough enough. It’s happening because selector-based testing couples your tests to implementation details of the UI, details that change as the app evolves, as the OS updates, and as the CI environment drifts.

Atlassian’s engineering team found that flakiness affected 15-21% of their builds despite significant investment in detection and monitoring tooling. Detection and monitoring tell you which tests are broken. They don’t tell you why mobile-specific failure modes keep generating new ones.

At some point, the ratio inverts. QA teams start at 90% creating tests, 10% maintaining. Then the debt accumulates: 10% creating, 90% maintaining. That compression is the script-fix treadmill at team scale.

Self-Healing Tests: Addressing the Root Cause

The selector-based model (identifying UI elements by CSS selectors, XPath strings, accessibility IDs, or resource IDs) is what makes mobile tests brittle. When those attributes change, the test breaks. When the OS changes how it structures the view hierarchy, the test breaks. When a layout update moves an element 10 pixels and changes its bounds, a tap-coordinate-based test breaks.

Self-healing tests take a different approach: they identify elements by what they look like and what they do, not by implementation attributes. A “Submit” button is recognizable because it says “Submit,” appears in a specific location relative to the form, and responds to a tap. When the button gets redesigned, moves to a new position, or gets a different accessibility ID, a vision-based test still recognizes it the same way a human tester would.

The connection to mobile specifically is that all four failure modes above have roots in the selector model:

Emulator drift causes gesture events to land on wrong elements because the element’s position shifted in the degraded emulator state. Vision-based interaction identifies the element by appearance, not coordinates.
Gesture-timing issues are amplified when tests use hardcoded timing constants. Self-healing systems observe the actual rendered state before interacting, eliminating timing assumptions.
System permission dialogs interrupt selector-based flows because the test script is looking for a specific element path that’s now obscured. Vision-based systems can identify and dismiss dialogs as visual objects without requiring script updates for each OS version’s dialog format.
Device fragmentation breaks selector-based tests when layout renders differently across device sizes. Vision-based tests see what the user sees, a button labeled “Sign In” at a certain position on screen, and adapt to layout differences automatically.

Self-healing isn’t a feature you bolt onto an existing Appium or Detox suite. It’s a different test execution architecture. The tests still validate the same user flows. They just stop relying on the implementation details that make those flows hard to maintain.

For teams evaluating this transition, the relevant question isn’t “will self-healing tests eliminate all flakiness?” The answer to that is: no. Network failures and genuine environment problems still produce test failures. The question is whether flakiness from UI changes, device fragmentation, and selector brittleness (the majority of mobile-specific flakiness) becomes a thing you fix once at the architectural level rather than a maintenance tax you pay every sprint.

Autonomous QA platforms built on vision-based execution extend this further: tests don’t require manual scripting, so the maintenance surface shrinks. When a new screen is added to a flow, the test suite discovers and covers it without a developer needing to update selectors. When an existing flow changes, the test adapts rather than breaking.

The agentic AI testing model takes this to its logical conclusion: a test agent that understands your app as a user would, interacts with it the way a user would, and detects failures the way a user would. No selectors to maintain. No timing constants to calibrate. No quarantine list to pretend you’ll address someday.

Stop Patching. Change the Foundation.

The cycle is predictable: flakiness shows up, retries get added, timeouts get extended, the suite slows down, trust erodes, and eventually the team stops treating CI red as a signal worth investigating. That’s not a discipline failure. It’s what happens when you apply web testing assumptions to mobile testing problems.

Mobile E2E tests fail for reasons that don’t exist in web automation. Emulator drift, gesture timing, system dialogs, and device fragmentation are mobile-specific problems. They require mobile-specific structural fixes, not more layers of script-level workarounds.

The structural fix is removing the selector dependency that makes tests brittle by design. Vision-based, self-healing tests stop breaking when UI changes. They don’t require updated accessibility IDs when a screen gets redesigned. They don’t need recalibrated timing constants when a new device profile enters your matrix. They adapt, because they see what users see rather than querying what developers implemented.

The maintenance treadmill doesn’t have a sustainable pace. It has an exit.

See self-healing mobile tests in action

Pie runs your critical flows on every PR. No selectors to maintain, no flakiness to debug.

Book a Demo

Frequently Asked Questions

Mobile tests face failure modes that don't exist in web automation: emulator behavioral drift under resource pressure, gesture-timing sensitivity, OS-level permission dialogs interrupting test flows, and network state changes between WiFi and cellular. Web tests run in a controlled browser environment. Mobile tests run on top of a full operating system with unpredictable state.

Retries reduce visible failure counts but don't fix the underlying problem. A test that passes on the third retry is still a flaky test. You've just moved the failure from your CI dashboard to your maintenance backlog. Retries also mask real bugs, increasing the chance a genuine regression slips through.

Emulator drift happens when CI emulators change behavior between runs due to resource starvation, accumulated state, or thermal throttling. An emulator that was responsive for your first ten test cases may start dropping gesture events or timing out on animations mid-suite. Unlike real device inconsistency, drift is invisible. The test code is identical; the environment isn't.

iOS and Android surface OS-level dialogs during test runs: permission requests for camera, location, notifications, and Bluetooth. None were triggered by your test script. The dialog blocks the UI, your test tries to tap an element that's now hidden, and the run fails. Web automation doesn't have this problem because browser permissions work differently.

Self-healing mobile test automation uses vision-based interaction, identifying elements by how they look and what they do, rather than selectors or XPath strings. When a button moves, gets renamed, or a screen gets redesigned, a self-healing test adapts rather than failing. It removes the architecture dependency that makes selector-based tests brittle.

Selector-based testing identifies elements by internal attributes: accessibility IDs, XPath expressions, or resource IDs. When any of those attributes change, even without a functional change to the app, the test breaks. Vision-based testing identifies elements by what they look like and where they are on screen, the same way a human tester would. UI changes that don't change the user's experience don't break vision-based tests.

Yes. Because self-healing tests interact with what's visible on screen rather than internal DOM or view hierarchy attributes, they adapt to layout differences across device sizes and OS versions. A button that renders at a different position on a tablet vs. a phone is still recognizable. The test doesn't need to be rewritten for each device profile.

Quarantine removes a flaky test from the main suite until someone fixes it. It's a deferral strategy, not a fix. Self-healing changes the underlying mechanism so tests don't become flaky in the first place when the UI changes. Quarantine manages the symptom. Self-healing addresses the cause.

Dhaval Shreyas

CEO & Co-founder at Pie

13 years building mobile infrastructure at Square, Facebook, and Instacart. Now building the QA platform he wished existed the whole time. LinkedIn →