Guide

Flaky Tests Explained: Causes, Costs, and How Vision-Based Testing Helps

Flaky tests waste 20+ hours weekly for most teams. Learn why traditional fixes fail and how vision-based testing eliminates flakiness at its source.

Dhaval Shreyas
Dhaval Shreyas
Co-founder & CEO at Pie
14 min read

What you’ll learn

  • The four root causes of test flakiness and why each one matters
  • Why retries, quarantines, and timeouts make the problem worse over time
  • How vision-based testing decouples tests from implementation details
  • A practical framework to measure your team’s flake rate

A test fails. You check the logs. Nothing in the codebase changed. You hit re-run, and it passes.

The World Quality Report found that organizations dedicate 30-50% of their testing resources to maintaining and updating test scripts. For a team of five QA engineers, that’s the equivalent of losing two full-time engineers to test babysitting.

The instinct is to add retries, increase timeouts, or quarantine the worst offenders. None of it works long-term because the problem runs deeper than individual tests. The architecture of selector-based testing is brittle.

This guide breaks down why tests become flaky, why common fixes backfire, and how vision-based testing solves the problem at its source.

What Makes a Test Flaky

A flaky test produces inconsistent results without any changes to the code being tested. Run it ten times, it passes eight. The two failures have no explanation that makes sense.

The causes fall into four categories:

1. Timing Dependencies

Tests assume operations complete within fixed windows. Network latency spikes. Database queries slow down. CI servers get busy.

A 2024 IEEE study found that 46.5% of flaky tests are “Resource-Affected,” meaning they pass or fail depending on computational resources available during execution. Same code, different machine specs, different results.

2. Selector Fragility

UI tests locate elements using CSS selectors or XPath expressions. When a developer renames a class or restructures a component, selectors break. The application works fine. The test doesn’t.

This is the most common source of flakiness in E2E testing, and it’s built into the architecture of every traditional automation framework.

3. Environment Inconsistency

Tests written on a developer’s machine behave differently in CI. Browser versions differ. Screen resolutions vary. System resources fluctuate. Failures appear only in specific contexts, making them nearly impossible to reproduce locally.

4. Shared State Pollution

Tests that don’t properly clean up after themselves contaminate subsequent tests. Run them in isolation and they pass. Run the full suite and they fail. Test order shouldn’t matter, but it does.

These four causes create a constant stream of false failures. Left unchecked, they quietly erode your release velocity.

The Hidden Costs

Every engineering team knows flaky tests are a problem. Few have quantified what they actually cost. When you break it down, the damage compounds across four dimensions.

1. Compute Waste

If your pipeline takes 20 minutes and engineers re-run twice daily due to flaky failures, that’s 40 minutes of wasted compute per developer per day. For a team of 10, that’s 7 hours of CI time. Every day. Use our test maintenance calculator to see what this costs your organization annually.

2. Context Switching

Every flaky failure pulls a developer out of their work. They stop coding, investigate the failure, determine it’s noise, re-run the pipeline, and try to get back to where they were. The mental context they’d built up is gone.

3. Trust Erosion

When tests fail randomly, engineers stop trusting the suite. Real bugs slip through because “it’s probably just flaky.” Dismissing failures becomes the default response, and actual regressions get lost in the noise.

4. Morale Drain

Nothing kills energy faster than debugging tests that aren’t actually broken. Engineers want to build features, not babysit infrastructure.

📊 The Hidden Cost

A 2025 empirical study found that 56% of software practitioners encounter flaky tests daily, weekly, or monthly. The same research cites industrial data showing developers spend 1.28% of their time repairing flaky tests, costing roughly $2,250 per developer per month.

Why Common Fixes Backfire

Faced with these costs, teams fight back. They add retries, quarantine bad actors, dedicate sprints to cleanup. Most of these fixes make the problem worse.

1. Quarantine Queues

Remove flaky tests from the critical path. Fix them later.

In practice, “later” never comes. The quarantine grows. Coverage shrinks. Eventually you’re running half your test suite and calling it good enough.

2. Increased Timeouts

Give everything more time. Maybe the flakiness goes away.

The fragility remains. Now your test suite takes 3x longer to run. You’ve traded one problem for another.

3. Retry Logic

Re-run failed tests automatically until they pass.

You’re paying for extra compute to run broken tests until they accidentally succeed. The underlying issues compound. With 1,000 tests at even 0.1% flake rate each, most of your PRs will still hit a flaky failure.

4. Maintenance Sprints

Dedicate a sprint to “test hygiene.” Put engineers on fixing duty.

Tests keep breaking faster than humans can fix them. Engineers rotate onto flaky test duty, fix a dozen, and watch a dozen more appear next sprint. You’re bailing water from a sinking ship.

Curious how vision-based testing works?

Drop your staging URL. We'll show you tests that don't break on every deploy.

Book a Demo

No credit card required

Why Traditional Frameworks Create Flakiness

Retries, quarantines, and timeouts all treat flakiness as a test-level problem. The real issue runs deeper: how traditional frameworks identify elements on the page.

Selenium, Cypress, and Playwright locate elements using selectors like CSS paths, XPaths, test IDs, and data attributes. These selectors couple your tests directly to implementation details. The test doesn’t ask “is this button visible?” It asks “does div.container > button.primary-cta exist?”

When a developer renames a CSS class, restructures a component, updates a UI library, or refactors page layout, the selector breaks. The application works fine and users see the same button, but the test fails because the underlying HTML changed.

Fast-shipping teams pay the highest price. Every deployment is a chance for UI code to diverge from test selectors. Teams deploying daily accumulate selector drift faster than teams deploying monthly, which means the most productive engineering orgs face the steepest flakiness tax. Our test maintenance cost calculator can help you put a number on it.

How Vision-Based Testing Eliminates Flakiness

If selectors are the problem, the solution is obvious: stop using them. Vision-based testing identifies elements the way humans do, by looking at the screen rather than parsing HTML.

Instead of searching for button#submit-form.primary-cta, a vision-based system finds “the blue Submit button in the bottom-right corner.” The button can be renamed, restyled, or moved. The test keeps working.

1. No Selector Dependencies

Tests don’t break when developers refactor components, rename classes, or update the DOM structure. The button still looks like a submit button, so the test passes. Your frontend team can ship design system updates, component library migrations, or full framework changes without touching a single test file.

2. Adaptive Waiting

Instead of fixed timeouts that guess how long operations take, vision-based systems wait until elements actually appear on screen. The test watches for the loading spinner to disappear and the content to render, exactly like a human would. No more arbitrary sleep(3000) calls that slow down fast environments and fail in slow ones.

3. Environment Resilience

Tests evaluate what’s rendered, not how it’s implemented. Browser version differences, viewport variations, and backend latency fluctuations matter less when the test asks “can I see the checkout button?” rather than “does this XPath resolve?” The same test runs reliably on a developer laptop, in CI, and across staging environments.

4. Self-Healing by Default

When UI changes, the system adapts automatically. Button moved from the sidebar to the header? The test finds it in the new location. Icon replaced with text? Still recognized as a submit action. Self-healing test automation isn’t a feature bolted on after the fact. It’s the natural consequence of testing what users see instead of testing implementation details.

The Bottom Line

These four capabilities compound. When you remove selector dependencies, adaptive waiting becomes possible. When tests evaluate rendered output, environment resilience follows naturally. The result is a fundamentally different maintenance profile:

AspectSelector-BasedVision-Based
Element identificationCSS/XPath selectorsVisual recognition
UI refactor impactTests breakTests adapt
Framework migrationRewrite entire suiteNo changes needed
Weekly maintenance20+ hoursNear-zero

What This Looks Like in Practice

Fi, the pet tech company behind GPS collars for dogs, hit the classic scaling wall. Their test suite couldn’t keep pace with release velocity. Flaky tests became a daily friction point.

After switching to vision-based testing with Pie:

  • Test creation dropped from days to hours
  • Maintenance burden fell by over 70%
  • Release validation went from 2-4 days to same-day cycles
📊 Customer Result

“Release validation went from two to three days to just a few hours. The way Pie set up allowed Fi to work alongside development without changing processes.” — Philip Hubert, Director of Mobile Engineering, Fi

Read the full case study →

Measure Your Flake Problem

Before fixing flakiness, you need to measure it. Most teams don’t track this systematically, which means they’re flying blind when prioritizing fixes. Here’s a framework to quantify your flake problem.

1. Flake Rate

What percentage of test failures are actual bugs vs. flaky failures? Track failed runs that pass on re-run without code changes.

How to measure: Flag every test failure. If the same test passes on re-run with zero code changes, mark it as flaky. Calculate: (flaky failures / total failures) × 100.

Thresholds: Below 5% is manageable. 5-10% needs dedicated attention. Above 10% is actively blocking your release velocity.

2. Re-run Frequency

How often do engineers retry failed pipelines? This is the clearest signal of how much flakiness disrupts daily work.

How to measure: Pull CI logs for the past 30 days. Count pipeline runs where the same commit was run multiple times. Divide by total commits merged.

Thresholds: One re-run per week is normal. Daily re-runs per engineer means significant time and compute burning on noise.

3. Investigation Time

How many hours weekly do engineers spend on failures that turn out to be flaky? This is often the largest hidden cost because it’s invisible in sprint tracking.

How to measure: Survey your team for one week. Ask them to log every test failure investigation and whether it turned out to be real or flaky. Most teams are shocked by the number.

4. Quarantine Size

How many tests are currently disabled, skipped, or marked as “flaky-allowed”? A growing quarantine signals architectural problems.

How to measure: Grep your test suite for skip annotations, disabled flags, or retry-until-pass wrappers. Track this number monthly.

What to watch: If quarantine grows faster than your test suite, you’re hiding flakiness rather than fixing it. Coverage is shrinking in disguise.

5. Coverage Trend

Is effective coverage growing, stable, or shrinking? If teams delete more tests than they add because maintenance hurts too much, coverage regresses even if test count stays flat.

How to measure: Track tests added vs. tests deleted per sprint. A healthy suite should grow with your codebase. If the ratio flips negative, maintenance burden is winning.

Our autonomous discovery breaks this tradeoff by expanding coverage without adding to maintenance load.

Fix the Architecture, Not the Symptoms

Flaky tests aren’t inevitable. They’re a symptom of test architecture that couples tests to implementation details.

The fix isn’t more maintenance, more retries, or more timeouts. It’s removing the coupling entirely.

Vision-based testing does this by shifting from “find this selector” to “find this element.” When tests evaluate what users actually see, they stop breaking every time developers touch the codebase.

Pie is an autonomous testing platform built on vision-based AI. If your team is ready to stop babysitting tests, book a demo. We’ll show you what selector-free testing looks like on your actual app.

See it in action

Watch AI agents test your app the way users actually use it. No scripts, no selectors.

Book a Demo

SOC 2 Type II certified · No source code access

Frequently Asked Questions

A flaky test produces different results without code changes. Same test, same code, different outcomes. Root causes include timing dependencies, selector fragility, environment inconsistencies, and shared state pollution.

Track test failures that pass on re-run without code changes. Divide by total test runs. Anything above 5% needs attention. Above 10% is actively hurting your release velocity.

Retries mask the problem without solving it. You're paying for compute to run broken tests until they accidentally pass. The underlying fragility remains and gets worse as your suite grows.

Selectors couple tests to implementation details. When a developer renames a CSS class or restructures a component, selectors break even though the app works fine. The test is testing the code structure, not the behavior.

Selenium finds elements by code attributes like CSS selectors and XPaths. Vision-based testing finds elements by how they look and behave, the same way a user would. When UI changes, Selenium breaks. Vision-based tests adapt.

Yes. It operates at the rendered UI layer, not the code layer. React, Vue, Angular, Rails, Django, legacy jQuery apps. If users can see it and interact with it, vision-based testing can test it.

No. Pie tests at the UI layer. Your codebase stays on your systems. We only interact with what's rendered on screen, the same way your users do.

Most teams see 80% coverage within the first hour. AI agents explore your app autonomously and generate tests for the flows they discover. No scripting required.


Dhaval Shreyas
Dhaval Shreyas
Co-founder & CEO at Pie

13 years building mobile infrastructure at Square, Facebook, and Instacart. Payment systems, video platforms, the works. Now building the QA platform he wished existed the whole time. LinkedIn →