Guide

How to Fix Flaky Tests in CI/CD: Detection, Quarantine, and Pipeline Hardening

Q: Should flaky tests block deployments?

Not unless they cover critical flows. Blocking on known-flaky tests trains engineers to ignore failures. Better to quarantine and track separately than to cry wolf on every build.

Q: How many retries are too many?

Three is the practical maximum. More retries mask real failures. If a test needs more than three attempts to pass consistently, fix the root cause instead of hiding it with retries.

Q: Can I run flaky tests in a separate pipeline?

Yes, and you should. Stable tests gate deployments; flaky tests run in parallel for monitoring. This preserves signal quality without losing coverage intent.

Q: How do flaky tests affect deployment velocity?

Atlassian found flaky tests caused 21% of their master build failures and wasted 150,000 developer hours annually. Each false positive adds 20-60 minutes of investigation time.

Q: What percentage of tests should be in quarantine?

Keep quarantine under 5% of your total test suite. Higher than 10% signals a systemic problem with test design, not individual test flakiness. If quarantine keeps growing, stop adding tests and fix the root causes.

Q: How do I convince leadership that flaky tests are costing us money?

Track retry compute costs and developer investigation time. Convert hours to dollars using average engineering salary. A dashboard showing weekly cost in dollars gets attention faster than abstract flakiness percentages.

Q: Should I fix flaky tests or delete them?

Depends on what they cover. If the test validates critical functionality, fix it. If it tests edge cases that rarely matter in production, delete it. A smaller, reliable suite beats a large, flaky one.

Q: Why do tests pass locally but fail in CI?

Usually environmental differences. Local machines have more resources, consistent timing, and no parallelism pressure. CI environments share resources, run tests concurrently, and have variable network latency. Pin versions, isolate resources, and run tests in containers to match environments.

Flaky tests caused 21% of Atlassian's master branch failures. Learn pipeline-specific detection, smart retry strategies, and hardening patterns that restore trust.

Dhaval Shreyas

Co-founder & CEO at Pie

9 min read

What you’ll learn

How flaky tests compound CI/CD costs beyond build time
Pipeline patterns that detect flakiness before merge
Retry and quarantine strategies specific to CI/CD
Metrics that make flaky test costs visible to leadership

Your CI pipeline should be a confidence machine. Code passes tests, gets deployed, ships to users. Simple.

Instead, it’s a lottery. The same commit fails on the first run, passes on the second. Engineers learn to re-run builds reflexively. “It’s probably flaky” becomes the default response to red builds. And that’s when real bugs start slipping through.

Atlassian’s engineering team quantified what this costs: flaky tests caused 21% of their master branch build failures and wasted an estimated 150,000 developer hours annually. Google’s research found similar patterns. Sixteen percent of their tests exhibit flaky behavior. And 84% of pass-to-fail transitions come from flakiness rather than real bugs.

This guide focuses specifically on the CI/CD context. You’ll learn how to detect flakiness before it enters your pipeline, implement retry strategies that don’t mask real failures, and harden your pipeline against the slow creep of test rot.

The CI/CD Impact

Flaky tests cost more in CI/CD than anywhere else. The damage compounds across two dimensions. Direct resource costs are easy to measure. Trust erosion is harder to quantify but more expensive.

Direct Costs

Build time waste — Each flaky test run consumes compute resources. A 30-minute test suite with a 10% flaky failure rate means roughly 3 minutes of wasted time per build, multiplied across dozens of daily builds.
Retry multiplication — When your policy is “re-run failed tests,” each flaky test doubles its execution cost. Some teams report 20-30% of CI compute spent on retries alone.
Queue delays — Failed builds block deployment queues. While engineers investigate or wait for re-runs, other commits pile up. Merge conflicts increase. Context switching costs compound.

Indirect Costs

Trust erosion — When engineers expect false positives, they stop investigating failures. Real bugs slip through because “it’s probably flaky” becomes the default assumption.
Alert fatigue — Teams disable notifications for known-flaky tests. Then they disable notifications for tests near known-flaky tests. Eventually, real failures get lost in the noise.
Velocity drag — Flaky tests consistently rank among the top causes of developer “bad days” in productivity research. Engineers spend time debugging infrastructure instead of shipping features.

The Trust Equation

A CI pipeline that lies 10% of the time is not 90% trustworthy. It’s 0% trustworthy. Engineers can’t afford to investigate every failure, so they investigate none. That’s when real bugs ship.

Three Ways to Detect Flaky Tests Before They Merge

You can’t fix flakiness you don’t measure. These three detection patterns work well together, catching flaky tests at different stages of their lifecycle.

1. Repeated Execution on New Tests

Catch flakiness before merge by running new or modified tests multiple times. If any run fails, that test deserves investigation before it enters your main branch.

# GitHub Actions example
jobs:
  stability-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Get changed test files
        id: changed
        run: |
          echo "files=$(git diff --name-only origin/main...HEAD | grep '\.test\.' | tr '\n' ' ')" >> $GITHUB_OUTPUT

      - name: Run changed tests 5 times
        if: steps.changed.outputs.files != ''
        run: |
          for i in {1..5}; do
            echo "Run $i of 5"
            npm test -- ${{ steps.changed.outputs.files }}
          done

Any failure across runs indicates potential flakiness worth investigating before merge.

2. Historical Failure Tracking

Track pass/fail rates across commits. Most CI platforms provide this data, and the pattern reveals which tests drift into flakiness over time. The insight isn’t just which tests are flaky now, but which tests are becoming flaky. A test that passed 100% last month but is at 95% this week is a leading indicator worth investigating.

# CircleCI insights query
circleci tests --insights | jq '.tests | sort_by(.flaky_rate) | reverse | .[:10]'

This query surfaces your top offenders, sorted by failure rate. Run it weekly to spot trends before they become emergencies.

Build alerts for tests crossing flakiness thresholds.

- name: Check flaky rate
  run: |
    FLAKY_RATE=$(cat test-results/flaky-report.json | jq '.flakyRate')
    if (( $(echo "$FLAKY_RATE > 0.05" | bc -l) )); then
      echo "::warning::Flaky rate ${FLAKY_RATE} exceeds 5% threshold"
    fi

3. Pass-on-Retry Flagging

When a test fails then passes on retry, flag it. This pattern catches tests that weren’t flaky when introduced but became flaky over time.

// jest-circus custom reporter
class FlakyReporter {
  onTestCaseResult(test, testCaseResult) {
    if (testCaseResult.numPassingAsserts > 0 && testCaseResult.failureMessages.length > 0) {
      this.flagFlaky(test.path, test.title);
    }
  }
}

Accumulate these flags. Tests repeatedly flagged across commits are systematically flaky.

Four Retry Strategies That Reduce False Positives

Detection tells you which tests are flaky. Now you need a response. Retries are a double-edged sword. Used wrong, they mask real failures. Used right, they reduce false positives.

1. Immediate Retry (Same Job)

Re-run failed tests within the same CI job. This is the simplest approach:

# Jest configuration
{
  "retries": 2
}

# Pytest configuration
pytest --reruns 2 --reruns-delay 1

Pros:

Fast execution, no extra job overhead
Simple configuration

Cons:

Same environment conditions persist
If flakiness is environmental, retry won’t help

2. Delayed Retry (New Job)

When same-environment retries don’t help, spin up a fresh environment:

jobs:
  test:
    # ... test steps ...

  retry-failed:
    needs: test
    if: failure()
    runs-on: ubuntu-latest
    steps:
      - name: Re-run failed tests only
        run: npm test -- --onlyFailed

Pros:

Fresh environment isolates environmental flakiness
More reliable signal

Cons:

Longer feedback loop
More compute cost

3. The Three-Strike Rule

A practical balance: retry twice before declaring failure.

test:
  strategy:
    max_retries: 2
  steps:
    - run: npm test

Why three attempts total?

First failure: Could be real or flaky
Second failure: Less likely to be coincidence
Third failure: Almost certainly a real issue

More than three retries hides real problems. If a test needs four attempts to pass, fix it. Don’t retry it.

Selective Retry

Not all tests deserve retries. Retrying a critical path test masks real failures. Retrying a fragile animation test saves investigation time. The difference matters.

Apply retries based on test characteristics, not blanket policies:

// Tag tests by priority
describe('Checkout flow', { retries: 0 }, () => {
  // Critical tests fail immediately
});

describe('Tooltip animations', { retries: 2 }, () => {
  // Known-fragile tests get retries
});

This distinction preserves signal quality where it matters most. A flaky checkout test should trigger immediate investigation. A flaky tooltip test can retry twice before escalating.

Stop Fighting Your Pipeline

See how vision-based testing eliminates retry loops and flaky failures.

Book a Demo

When to Quarantine Flaky Tests

Retries help, but some tests are too flaky to remain in the critical path. Quarantine removes them from blocking deployments while preserving visibility for repair.

1. Tag Flaky Tests

Tagging requires discipline but pays off in visibility. When a test enters quarantine, link it to a tracking ticket. This creates accountability and prevents “quarantine and forget” patterns.

// Use a custom tag
test.skip('checkout calculates tax', { flaky: 'JIRA-1234' }, () => {
  // Test code
});

Alternatively, use directory structure for teams that prefer filesystem-based organization:

tests/
  stable/
  quarantine/
    checkout-tax.test.js  # Moved here when flagged

2. Separate Pipeline Stages

The key insight here is decoupling. Quarantined tests still run, so you maintain visibility into their behavior. But they don’t block deployments, so engineers aren’t forced to ignore failures or rerun builds.

stages:
  - name: stable-tests
    script: npm test -- stable/
    # Blocks deployment

  - name: quarantine-tests
    script: npm test -- quarantine/
    allow_failure: true
    # Runs but doesn't block

This setup preserves the CI signal while removing the friction. Engineers can ship, and the quarantine report still shows which tests need repair.

3. Monitor Quarantine Metrics

Quarantine without monitoring becomes a graveyard. Tests enter and never leave. The quarantine percentage creeps up until it represents a significant portion of your suite.

Track these three signals to prevent quarantine from becoming a dumping ground:

Quarantine size over time — Is it growing or shrinking? Growth means you’re adding flaky tests faster than you’re fixing them.
Days in quarantine per test — Set a maximum (30 days is reasonable). Tests older than that need a decision: fix or delete.
Entry/exit rate — Healthy quarantine has balanced flow. Tests enter, get fixed, and exit. One-way flow into quarantine is a warning sign.

4. Prevent Quarantine Rot

Enforce quarantine limits in CI. When the quarantine grows too large, make it visible by failing builds. This creates pressure to fix or delete rather than accumulate.

- name: Quarantine size check
  run: |
    COUNT=$(find tests/quarantine -name "*.test.js" | wc -l)
    TOTAL=$(find tests -name "*.test.js" | wc -l)
    PERCENTAGE=$((COUNT * 100 / TOTAL))
    if [ $PERCENTAGE -gt 10 ]; then
      echo "::error::Quarantine exceeds 10% of test suite"
      exit 1
    fi

Ten percent is a reasonable ceiling. Beyond that, quarantine stops being a temporary holding area and starts being a second test suite that nobody trusts.

How Pipeline Hardening Prevents Flakiness

Detection catches flakiness after it exists. Hardening prevents it from entering the pipeline at all.

1. Pre-Merge Stability Gates

Require new tests to pass multiple times:

stability-gate:
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  script:
    - NEW_TESTS=$(git diff --name-only origin/main | grep '\.test\.')
    - |
      for test in $NEW_TESTS; do
        for i in $(seq 1 5); do
          npm test -- $test || exit 1
        done
      done

2. Environment Consistency

Version drift is a silent flakiness generator. Pin everything:

# Pin everything
services:
  db:
    image: postgres:14.8  # Not postgres:14 or postgres:latest
  redis:
    image: redis:7.0.12

env:
  NODE_VERSION: 18.17.1
  BROWSER_VERSION: chrome-115

Use containers for browser tests:

browser-tests:
  image: mcr.microsoft.com/playwright:v1.38.0
  # Identical environment everywhere

3. Resource Isolation

Tests competing for CPU, memory, or ports create timing-dependent failures:

# Limit parallelism to prevent resource contention
parallel:
  matrix:
    - SHARD: [1, 2, 3, 4]
  max-concurrent: 2  # Don't starve the machine

Allocate dedicated resources per shard:

services:
  - name: postgres
    alias: db-$SHARD  # Each shard gets its own DB

Why Parallel Execution Creates Flakiness

Parallelism speeds up CI but introduces new flakiness vectors. These three show up most often.

1. Race Conditions

Tests running simultaneously can interfere:

// Both tests hit the same user record
// Whichever finishes first wins
test('update user name', async () => {
  await db.update('users', { id: 1, name: 'Test A' });
});

test('update user email', async () => {
  await db.update('users', { id: 1, email: '[email protected]' });
});

Fix: Use unique test data with proper test isolation:

test('update user name', async () => {
  const userId = await createTestUser();
  await db.update('users', { id: userId, name: 'Test A' });
});

2. Port Conflicts

Multiple test shards try to bind the same port. This is especially common with integration tests that spin up servers.

Error: EADDRINUSE: port 3000 already in use

The error is intermittent because it depends on execution order. Sometimes one shard finishes before another starts. Sometimes they collide.

Fix: Dynamic port allocation. Let the OS assign an available port instead of hardcoding:

const server = app.listen(0);  // Let OS assign port
const port = server.address().port;

For tests that need predictable ports, use shard-based ranges:

env:
  TEST_PORT: $((3000 + $SHARD * 100))

3. Shared State Across Workers

When workers share databases, caches, or file systems, one worker’s setup can corrupt another worker’s assertions. Test A creates a user, Test B deletes all users, Test A’s assertion fails. The test isn’t flaky. The isolation is broken.

Fix: Worker-specific isolation. Give each worker its own sandbox:

// Each worker gets its own database schema
const schema = `test_worker_${process.env.JEST_WORKER_ID}`;
beforeAll(() => db.createSchema(schema));
afterAll(() => db.dropSchema(schema));

Measuring Flaky Test Costs

What gets measured gets managed. These metrics make the cost impossible to ignore.

1. Key Metrics to Track

Metric	How to Calculate	Target
Flaky rate	Tests with mixed pass/fail same commit / total tests	<2%
CI false positive rate	Failed builds without code issues / total builds	<1%
Mean time to investigate	Time from failure to resolution	<15 min
Quarantine percentage	Quarantined tests / total tests	<5%
Retry compute cost	Retry job minutes / total job minutes	<10%

2. Dashboard Components

Build dashboards with three essential views:

Flaky test leaderboard — Top 10 flakiest tests, sortable by failure rate, days in repo, and cost (retries × build time)
Trend graphs — Flaky rate over time. Are you getting better or worse?
Developer impact — Hours lost to flaky test investigation per week. This number gets leadership attention.

3. Sample Datadog/Grafana Query

avg:ci.test.flaky_rate{env:production, team:*} by {team}

Display prominently. Teams start paying attention when their name appears on a public dashboard.

Once you have this visibility, patterns emerge. Some teams discover their flaky rate is manageable with better retry policies. Others realize they’re spending more on test maintenance than feature development. That clarity is the point. You can’t fix what you can’t see, and you can’t prioritize what you can’t quantify.

Teams hitting the latter wall often find that autonomous testing changes the equation entirely. When tests adapt to timing and UI changes automatically, the flakiness vectors covered in this guide stop being daily fires.

Your Pipeline Should Be a Confidence Machine

Every false positive erodes the value of every real signal. That’s the math of flaky tests in CI/CD.

When your pipeline says a test failed, engineers should investigate the code, not the infrastructure. That’s the bar. Anything less means real bugs slip through while teams chase ghosts.

Zero flaky tests isn’t realistic with scripted automation. Some timing sensitivity is inevitable when tests encode implementation details. But teams using Pie are experiencing exactly that. Pipelines where every failure is a real bug, not infrastructure noise.

If you’ve implemented everything here and test maintenance still dominates your sprint cycles, the problem isn’t configuration. It’s the scripted testing model itself.

Stop Chasing Flaky Failures

See how teams are shipping faster with pipelines that only fail on real bugs.

Book a Demo

Frequently Asked Questions