Guide

How to Fix Flaky Tests: A Practical Guide for QA Teams

Stop letting flaky tests block your deploys. Learn the triage framework, code patterns, and team practices that got Slack and GitHub to single-digit flake rates.

Dhaval Shreyas
Dhaval Shreyas
Co-founder & CEO at Pie
10 min read

What you’ll learn

  • A triage framework for prioritizing flaky test fixes
  • Code patterns that eliminate timing and state issues
  • When to quarantine vs. delete vs. fix immediately
  • Team workflows that prevent flaky test accumulation

Slack reduced their flaky test failure rate from 56.76% to 3.85%. GitHub cut flaky-related build failures by 18x. Both companies didn’t achieve these numbers by ignoring the problem or adding retries. They built systematic approaches to identifying, fixing, and preventing flakiness.

You don’t need a dedicated infrastructure team to get there. The patterns that work are well-documented and implementable this week.

The goal isn’t zero flaky tests. That’s unrealistic in any sufficiently complex codebase. The goal is a test suite your team trusts enough to act on. When a test fails, engineers should investigate the code, not the test infrastructure.

The Triage Framework

Your team has limited time and a growing backlog of flaky tests. Fixing them all with equal urgency is a recipe for burnout. The tests blocking your checkout flow deserve immediate attention. The tests failing intermittently on a deprecated admin page can wait. Triage first.

Severity Matrix

ImpactFrequencyAction
Critical flow (checkout, auth)High (>10% failure rate)Fix immediately, block deployment
Critical flowLow (<5% failure rate)Fix within sprint
Non-criticalHighQuarantine, schedule fix
Non-criticalLowQuarantine, review monthly

Data Collection First

Before triage, you need data. Track these metrics for every test:

  • Failure rate: Failures / total runs over 30 days
  • Last code change: When was the test or its target code modified?
  • Coverage scope: What user flows does this test protect?
  • Fix complexity: Estimated hours based on root cause

Without this data, you’re guessing. Most CI platforms provide historical failure rates. If yours doesn’t, instrument it.

Fixing Async Wait Issues

Async waits are the most common source of flaky tests. Research from ICSE 2021 found that in UI-based test suites, nearly half of all flakiness stems from tests that don’t properly wait for asynchronous operations. The fix is always the same pattern: replace implicit timing with explicit conditions.

The Problem Pattern

// Flaky: assumes 2 seconds is enough
await page.click('#submit');
await sleep(2000);
const text = await page.textContent('#result');
expect(text).toBe('Order confirmed');

This fails when:

  • Server response takes 2.5 seconds under load
  • Animation delays element visibility
  • Network latency varies between environments

The Fix Pattern

Instead of guessing how long to wait, tell the test exactly what condition must be true before proceeding. The waitForSelector call below doesn’t continue until the result element is actually visible in the DOM. No timing assumptions. No guessing. The test waits for reality to match expectations.

// Stable: waits for actual condition
await page.click('#submit');
await page.waitForSelector('#result', { state: 'visible' });
const text = await page.textContent('#result');
expect(text).toBe('Order confirmed');

Advanced: Compound Conditions

Some operations require multiple things to happen in sequence. A form submission might need both the API response to return AND the UI to update. Using Promise.all lets you wait for the network response while the click triggers the request. The test only proceeds once both the API call completes and the confirmation text appears.

// Wait for loading to complete AND element to appear
await Promise.all([
  page.waitForResponse(resp => resp.url().includes('/api/orders')),
  page.click('#submit')
]);
await page.waitForSelector('#result:has-text("confirmed")');

Framework-Specific Patterns

Each test framework has its own idioms for stable async handling. For Playwright users dealing with more complex flakiness scenarios, we’ve written a dedicated Playwright flaky tests guide.

Playwright builds assertions directly into its expect API. The toHaveText assertion automatically retries until the condition is met or the timeout expires. No explicit wait call needed.

await expect(page.locator('#result')).toHaveText('Order confirmed', {
  timeout: 10000
});

Cypress takes a similar approach with its built-in retry mechanism. The should assertion keeps checking until the element contains the expected text or 10 seconds pass.

cy.get('#result', { timeout: 10000 }).should('contain', 'Order confirmed');

Selenium requires more explicit wait construction through WebDriverWait. You define the condition using Expected Conditions, and the driver polls until that condition returns true.

WebDriverWait(driver, 10).until(
    EC.text_to_be_present_in_element((By.ID, 'result'), 'Order confirmed')
)
Async Waits and Self-Healing

Self-healing test automation handles timing variations automatically by observing actual element states rather than relying on hardcoded waits. This eliminates the most common source of flakiness without manual intervention.

Eliminating Shared State

When Test A writes to a database and Test B reads from it, you’ve created an invisible dependency. Test B passes when it runs after A, fails when it runs before. Multiply this across hundreds of tests and you get a suite where order matters more than code quality. The path out is complete test isolation.

Database Isolation

Option 1: Transaction Rollback

Wrap each test in a transaction, rollback after. Every INSERT, UPDATE, and DELETE gets undone when the test finishes. The next test starts with a pristine database.

@pytest.fixture(autouse=True)
def transactional_test(db_connection):
    transaction = db_connection.begin()
    yield
    transaction.rollback()

This approach is fast because there’s no actual data to clean up. The rollback is instant. The limitation is that it doesn’t work when your test code opens multiple database connections, since each connection has its own transaction scope.

Option 2: Unique Data Per Test

Generate unique identifiers for all test data. Each test creates its own users, orders, and records with UUIDs or timestamps baked into the identifiers. Tests never collide because they’re operating on completely separate data.

const testId = `test_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
const user = await createUser({ email: `${testId}@test.com` });

This works with any architecture, including microservices and distributed systems. The tradeoff is that test data accumulates over time. You’ll need a periodic cleanup job to purge old test records, or your test database grows indefinitely.

Option 3: Dedicated Database Per Test Suite

Spin up isolated databases for parallel test execution. Each test worker gets its own Postgres instance. Zero possibility of cross-contamination.

# docker-compose.test.yml
services:
  db-suite-1:
    image: postgres:14
    environment:
      POSTGRES_DB: test_suite_1
  db-suite-2:
    image: postgres:14
    environment:
      POSTGRES_DB: test_suite_2

True isolation comes at a cost. You need infrastructure to spin up and tear down databases, and each instance takes time to initialize. For large test suites running in parallel, the isolation is worth the setup overhead.

Cache Isolation

Redis, Memcached, in-memory caches. They all persist data between test runs unless you explicitly clear them. A test that passes locally fails in CI because someone else’s test populated the cache with stale data.

Option 1: Flush before each test

The nuclear option. Wipe everything before each test starts. Simple and reliable, but slow if your cache is large.

beforeEach(async () => {
  await redis.flushall();
  await memcached.flush();
});

Option 2: Test-specific key prefixes

Each test worker gets its own namespace. Tests running in parallel never read each other’s cached values because they’re looking at different keys entirely.

const cachePrefix = `test_${process.env.TEST_WORKER_ID}_`;
cache.get(`${cachePrefix}${key}`);

This approach is faster than flushing because you’re not clearing data that doesn’t affect your test. The tradeoff is that you need to consistently apply the prefix across all cache operations in your test code.

Browser State Isolation

Login sessions, shopping carts, form data. Browser state accumulates across tests unless you explicitly reset it. A test that expects a logged-out state fails because a previous test logged in and never logged out.

Creating a fresh browser context for each test gives you a clean slate. No cookies, no localStorage, no session data carried over from previous tests. The newContext() call below creates an isolated browser environment that behaves like a fresh incognito window.

// Playwright
test.beforeEach(async ({ browser }) => {
  const context = await browser.newContext();
  // Use this context for the test
});

Stop Chasing Flaky Test Failures

See how vision-based testing eliminates timing-based failures automatically.

Book a Demo

Breaking Order Dependencies

Run your test suite twice with different orderings. If results differ, you have order dependencies. These are some of the trickiest flaky tests to debug because they only surface when test execution order changes, which happens unpredictably in parallel CI runs.

Detection

Order dependencies surface frequently in suites that grew organically without strict isolation. The tests passed for years because they happened to run in alphabetical order. Then someone adds parallelization and suddenly half the suite fails.

Randomizing test execution order forces these hidden dependencies to surface. When Test B fails only when it runs before Test A, you’ve found an order dependency. The flags below tell each framework to shuffle the test order using a seed value you can reproduce later for debugging.

# Pytest
pytest --randomly-seed=12345

# Jest
jest --randomize

# RSpec
rspec --order random

Run your full suite three times with different seeds. If any test fails inconsistently, investigate whether it depends on state created by another test.

Common Patterns and Fixes

Pattern 1: Test A creates data, Test B expects it

This is the most common order dependency. Test A creates a database record. Test B queries for that exact record by ID. When A runs first, B passes. When B runs first, it fails because the record doesn’t exist yet. The fix is making each test responsible for creating and cleaning up its own data.

// Before (dependent)
test('A: create user', async () => {
  await createUser({ id: 1, name: 'Test' });
});

test('B: fetch user', async () => {
  const user = await getUser(1); // Assumes A ran first
  expect(user.name).toBe('Test');
});
// After (independent)
test('A: create user', async () => {
  await createUser({ id: 1, name: 'Test' });
  // Cleanup
  await deleteUser(1);
});

test('B: fetch user', async () => {
  // Create its own data
  await createUser({ id: 2, name: 'Test' });
  const user = await getUser(2);
  expect(user.name).toBe('Test');
  await deleteUser(2);
});

Pattern 2: Global state modification

Feature flags, configuration objects, singleton instances. When one test modifies global state and doesn’t restore it, every subsequent test runs in a polluted environment. The fix is capturing the original state before each test and restoring it afterward, regardless of whether the test passes or fails.

// Before (modifies global)
test('A: enable feature', () => {
  featureFlags.enable('new_checkout');
  // Test with flag enabled
});

test('B: expects default state', () => {
  // Fails if A ran first and didn't reset
  expect(featureFlags.isEnabled('new_checkout')).toBe(false);
});
// After (isolated)
let originalFlags;

beforeEach(() => {
  originalFlags = featureFlags.getAll();
});

afterEach(() => {
  featureFlags.setAll(originalFlags);
});

Quarantine Strategy

Not every flaky test can be fixed immediately. Quarantine provides a middle ground: don’t block CI, but don’t lose the test coverage intent.

Implementation

Step 1: Tag flaky tests

Mark tests as flaky in your code so tooling can identify them. The test.skip with a flaky flag tells your test runner this test is known to be unreliable. Some teams prefer moving flaky tests to a separate directory entirely, which makes filtering easier in CI.

test.skip('known flaky: checkout flow', { flaky: true }, async () => {
  // Test code
});

The directory approach gives you physical separation. Everything in stable/ runs in the blocking CI job. Everything in quarantine/ runs separately.

tests/
  stable/
  quarantine/

Step 2: Run quarantined tests separately

Configure your CI pipeline to run stable and quarantined tests as separate jobs. The stable job blocks merges on failure. The quarantine job runs the same tests but with allow_failure: true, so flaky failures don’t block deploys while you work on fixes.

# CI pipeline
jobs:
  stable-tests:
    script: npm test -- --ignore 'quarantine/**'

  quarantine-tests:
    script: npm test -- quarantine/**
    allow_failure: true

Step 3: Monitor quarantine metrics

Quarantine without visibility becomes a dumping ground. Track these numbers weekly and display them on a dashboard your team sees daily.

  • Number of tests in quarantine - Is it growing or shrinking?
  • Days in quarantine per test - Which tests have been stuck the longest?
  • Quarantine entry/exit rate - Are you fixing faster than you’re adding?

Step 4: Enforce quarantine limits

Set a policy: tests in quarantine for more than 30 days must be fixed or deleted. No exceptions. Schedule weekly quarantine review where someone is accountable for reviewing each quarantined test and either committing to a fix date or deleting it.

Preventing Future Flakiness

Every flaky test you fix today could have been prevented yesterday. The patterns that produce flakiness are predictable. Hardcoded sleeps, shared database records, implicit dependencies on test execution order. Baking prevention into your test architecture costs less than debugging intermittent failures at 2 AM.

Test Design Principles

Principle 1: No implicit waits

Ban sleep() calls at the linter level. The ESLint rules below flag any test that uses an empty waitFor callback or queries elements with getBy instead of findBy. These rules turn implicit timing assumptions into CI failures before they become flaky tests in production.

// eslint-plugin-testing-library enforces this
rules: {
  'testing-library/no-wait-for-empty-callback': 'error',
  'testing-library/prefer-find-by': 'error'
}

Principle 2: Factory functions for data

Every test should create its own data using factory functions that generate unique identifiers. The factory below creates users with UUID-based emails so two tests can never collide on the same record. Overrides let individual tests customize specific fields while keeping everything else random.

// test-utils/factories.js
export function createTestUser(overrides = {}) {
  return {
    id: uuid(),
    email: `${uuid()}@test.example`,
    createdAt: new Date(),
    ...overrides
  };
}

Principle 3: Environment parity

Tests that pass locally but fail in CI usually rely on environmental differences. Running your test database and cache in Docker makes your local environment match CI exactly. The compose file below spins up the same Postgres and Redis versions your CI uses.

services:
  app:
    build: .
    depends_on:
      - db
      - redis
  db:
    image: postgres:14
  redis:
    image: redis:7

CI Pipeline Hardening

These patterns focus on build-time prevention. For comprehensive CI/CD pipeline strategies including quarantine workflows and parallel test execution, see our flaky tests in CI/CD guide.

Run new tests multiple times before merge

Catch flaky tests before they merge by running new or changed tests multiple times in CI. If a test passes once but fails on runs two through five, you’ve caught flakiness before it pollutes your main branch. The workflow below runs only the tests affected by the PR, five times in a row.

on: pull_request

jobs:
  stability-check:
    steps:
      - name: Run new tests 5 times
        run: |
          for i in {1..5}; do
            npm test -- --changed
          done

Block merge on flaky detection

If your test framework outputs a flaky rate metric, gate merges on it. The script below reads the flaky rate from your test results and fails the build if more than 5% of test runs were inconsistent. This prevents PRs that introduce new flakiness from merging.

- name: Check flaky rate
  run: |
    if [[ $(cat test-results/flaky-rate.txt) -gt 5 ]]; then
      echo "Flaky rate above 5%, failing build"
      exit 1
    fi

Team Practices That Help

You can have perfect test architecture and still accumulate flaky tests if your team treats them as someone else’s problem. The engineering teams that maintain healthy test suites share common practices. They make flakiness visible, assign clear ownership, and treat test reliability as a first-class concern.

1. The “Boy Scout Rule” for Tests

Leave the test suite cleaner than you found it. If you encounter a flaky test during unrelated work:

  • Spend 15 minutes investigating
  • If fixable quickly, fix it
  • If not, document what you learned and add to the backlog

2. Flaky Test Friday

Dedicate one hour weekly to flaky test triage. Rotate ownership. Track:

  • Tests fixed
  • Tests quarantined
  • Tests deleted

Make it a game. Celebrate reductions.

3. Blame-Free Flaky Test Creation

Flaky tests happen. Don’t penalize engineers for creating them. Penalize for leaving them unfixed. The CI failure that identifies flakiness is doing its job.

Measuring Progress

You can’t improve what you don’t measure. Without metrics, flaky test discussions become finger-pointing sessions. With metrics, they become engineering conversations about trade-offs and priorities. Track these monthly and display them where your team can see them.

MetricTargetHow to Measure
Flaky test rate<2%Failures without code change / total runs
Quarantine size<5% of suiteQuarantined tests / total tests
Mean time to fix<48 hoursTime from flaky detection to merge
CI false positive rate<1%Builds failed for non-code reasons

Put these on a dashboard your team sees every day. When the numbers are visible, they become part of the conversation.

Real Results

Teams using autonomous testing platforms report 60-80% reduction in test maintenance burden. The architecture eliminates the manual timing adjustments that cause most flakiness by design.

Build the Infrastructure, Then the Tests

Every flaky test that blocks a deploy costs more than the time to investigate it. It costs the trust that makes continuous deployment work. When engineers stop trusting test failures, they start ignoring them. That’s when real bugs ship.

The patterns in this guide work. Explicit waits eliminate timing randomness. Isolation removes state interference. Quarantine keeps flaky tests from blocking real work while you fix them. These aren’t novel techniques. They’re the same approaches that got Slack and GitHub to single-digit flake rates.

But fixing individual tests only solves today’s problem. The underlying architecture that produced those tests will produce more. Real stability comes from making flakiness structurally harder to introduce: factory functions that generate unique data, CI gates that catch instability before merge, team practices that treat flaky tests as defects rather than annoyances.

The investment is real. The payoff is a test suite that serves its purpose. When tests fail, engineers investigate the code, not the infrastructure. That’s when your test suite becomes the safety net it was supposed to be.

Less Maintenance. More Shipping.

See how teams are making the shift to zero-maintenance testing.

Book a Demo

Frequently Asked Questions

Simple timing issues take 30 minutes. Shared state problems take 2-4 hours. Environmental dependencies can take days. Triage first: fix the quick wins, quarantine the complex ones.

Neither exclusively. Fix high-impact flaky tests immediately when found (they're blocking CI). Schedule quarterly cleanup days for accumulated technical debt. Waiting for 'later' means never.

Slack reduced flaky test failures from 56.76% to 3.85%. GitHub cut flaky-related build failures by 18x. The ROI is not just time saved. It is trust restored in your entire test suite.

Yes. Self-healing test platforms handle timing variations automatically. But you still need proper architecture. No automation fixes tests that share mutable state without isolation.

Run the test 10 times without changing code. If it fails inconsistently, it's flaky. If it fails consistently, it's a bug. Also check if the failure correlates with unrelated code changes. Real bugs fail after specific commits.

Quarantine flaky tests to a separate job that doesn't block merges. Track quarantine size weekly. Set a policy that tests must be fixed or deleted within 30 days of quarantine. Never just add retries and ignore the root cause.

CI environments have less resources, different timing, and no cached state from previous runs. Tests that rely on fast responses, specific database records, or browser cookies often pass locally but fail in CI. The fix is proper test isolation and explicit waits.

Use a severity matrix. Fix tests covering critical flows (checkout, auth) with high failure rates immediately. Quarantine non-critical tests with low failure rates. Delete tests that have been quarantined for months with no plan to fix.


Dhaval Shreyas
Dhaval Shreyas
Co-founder & CEO at Pie

13 years building mobile infrastructure at Square, Facebook, and Instacart. Payment systems, video platforms, the works. Now building the QA platform he wished existed the whole time. LinkedIn →