How to Fix Flaky Tests in CI/CD: Detection, Quarantine, and Pipeline Hardening
Flaky tests caused 21% of Atlassian's master branch failures. Learn pipeline-specific detection, smart retry strategies, and hardening patterns that restore trust.
What you’ll learn
- How flaky tests compound CI/CD costs beyond build time
- Pipeline patterns that detect flakiness before merge
- Retry and quarantine strategies specific to CI/CD
- Metrics that make flaky test costs visible to leadership
Your CI pipeline should be a confidence machine. Code passes tests, gets deployed, ships to users. Simple.
Instead, it’s a lottery. The same commit fails on the first run, passes on the second. Engineers learn to re-run builds reflexively. “It’s probably flaky” becomes the default response to red builds. And that’s when real bugs start slipping through.
Atlassian’s engineering team quantified what this costs: flaky tests caused 21% of their master branch build failures and wasted an estimated 150,000 developer hours annually. Google’s research found similar patterns. Sixteen percent of their tests exhibit flaky behavior. And 84% of pass-to-fail transitions come from flakiness rather than real bugs.
This guide focuses specifically on the CI/CD context. You’ll learn how to detect flakiness before it enters your pipeline, implement retry strategies that don’t mask real failures, and harden your pipeline against the slow creep of test rot.
The CI/CD Impact
Flaky tests cost more in CI/CD than anywhere else. The damage compounds across two dimensions. Direct resource costs are easy to measure. Trust erosion is harder to quantify but more expensive.
Direct Costs
-
Build time waste — Each flaky test run consumes compute resources. A 30-minute test suite with a 10% flaky failure rate means roughly 3 minutes of wasted time per build, multiplied across dozens of daily builds.
-
Retry multiplication — When your policy is “re-run failed tests,” each flaky test doubles its execution cost. Some teams report 20-30% of CI compute spent on retries alone.
-
Queue delays — Failed builds block deployment queues. While engineers investigate or wait for re-runs, other commits pile up. Merge conflicts increase. Context switching costs compound.
Indirect Costs
-
Trust erosion — When engineers expect false positives, they stop investigating failures. Real bugs slip through because “it’s probably flaky” becomes the default assumption.
-
Alert fatigue — Teams disable notifications for known-flaky tests. Then they disable notifications for tests near known-flaky tests. Eventually, real failures get lost in the noise.
-
Velocity drag — Flaky tests consistently rank among the top causes of developer “bad days” in productivity research. Engineers spend time debugging infrastructure instead of shipping features.
A CI pipeline that lies 10% of the time is not 90% trustworthy. It’s 0% trustworthy. Engineers can’t afford to investigate every failure, so they investigate none. That’s when real bugs ship.
Three Ways to Detect Flaky Tests Before They Merge
You can’t fix flakiness you don’t measure. These three detection patterns work well together, catching flaky tests at different stages of their lifecycle.
1. Repeated Execution on New Tests
Catch flakiness before merge by running new or modified tests multiple times. If any run fails, that test deserves investigation before it enters your main branch.
# GitHub Actions example
jobs:
stability-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Get changed test files
id: changed
run: |
echo "files=$(git diff --name-only origin/main...HEAD | grep '\.test\.' | tr '\n' ' ')" >> $GITHUB_OUTPUT
- name: Run changed tests 5 times
if: steps.changed.outputs.files != ''
run: |
for i in {1..5}; do
echo "Run $i of 5"
npm test -- ${{ steps.changed.outputs.files }}
done
Any failure across runs indicates potential flakiness worth investigating before merge.
2. Historical Failure Tracking
Track pass/fail rates across commits. Most CI platforms provide this data, and the pattern reveals which tests drift into flakiness over time. The insight isn’t just which tests are flaky now, but which tests are becoming flaky. A test that passed 100% last month but is at 95% this week is a leading indicator worth investigating.
# CircleCI insights query
circleci tests --insights | jq '.tests | sort_by(.flaky_rate) | reverse | .[:10]'
This query surfaces your top offenders, sorted by failure rate. Run it weekly to spot trends before they become emergencies.
Build alerts for tests crossing flakiness thresholds.
- name: Check flaky rate
run: |
FLAKY_RATE=$(cat test-results/flaky-report.json | jq '.flakyRate')
if (( $(echo "$FLAKY_RATE > 0.05" | bc -l) )); then
echo "::warning::Flaky rate ${FLAKY_RATE} exceeds 5% threshold"
fi
3. Pass-on-Retry Flagging
When a test fails then passes on retry, flag it. This pattern catches tests that weren’t flaky when introduced but became flaky over time.
// jest-circus custom reporter
class FlakyReporter {
onTestCaseResult(test, testCaseResult) {
if (testCaseResult.numPassingAsserts > 0 && testCaseResult.failureMessages.length > 0) {
this.flagFlaky(test.path, test.title);
}
}
}
Accumulate these flags. Tests repeatedly flagged across commits are systematically flaky.
Four Retry Strategies That Reduce False Positives
Detection tells you which tests are flaky. Now you need a response. Retries are a double-edged sword. Used wrong, they mask real failures. Used right, they reduce false positives.
1. Immediate Retry (Same Job)
Re-run failed tests within the same CI job. This is the simplest approach:
# Jest configuration
{
"retries": 2
}
# Pytest configuration
pytest --reruns 2 --reruns-delay 1
Pros:
- Fast execution, no extra job overhead
- Simple configuration
Cons:
- Same environment conditions persist
- If flakiness is environmental, retry won’t help
2. Delayed Retry (New Job)
When same-environment retries don’t help, spin up a fresh environment:
jobs:
test:
# ... test steps ...
retry-failed:
needs: test
if: failure()
runs-on: ubuntu-latest
steps:
- name: Re-run failed tests only
run: npm test -- --onlyFailed
Pros:
- Fresh environment isolates environmental flakiness
- More reliable signal
Cons:
- Longer feedback loop
- More compute cost
3. The Three-Strike Rule
A practical balance: retry twice before declaring failure.
test:
strategy:
max_retries: 2
steps:
- run: npm test
Why three attempts total?
- First failure: Could be real or flaky
- Second failure: Less likely to be coincidence
- Third failure: Almost certainly a real issue
More than three retries hides real problems. If a test needs four attempts to pass, fix it. Don’t retry it.
Selective Retry
Not all tests deserve retries. Retrying a critical path test masks real failures. Retrying a fragile animation test saves investigation time. The difference matters.
Apply retries based on test characteristics, not blanket policies:
// Tag tests by priority
describe('Checkout flow', { retries: 0 }, () => {
// Critical tests fail immediately
});
describe('Tooltip animations', { retries: 2 }, () => {
// Known-fragile tests get retries
});
This distinction preserves signal quality where it matters most. A flaky checkout test should trigger immediate investigation. A flaky tooltip test can retry twice before escalating.
Stop Fighting Your Pipeline
See how vision-based testing eliminates retry loops and flaky failures.
Book a DemoWhen to Quarantine Flaky Tests
Retries help, but some tests are too flaky to remain in the critical path. Quarantine removes them from blocking deployments while preserving visibility for repair.
1. Tag Flaky Tests
Tagging requires discipline but pays off in visibility. When a test enters quarantine, link it to a tracking ticket. This creates accountability and prevents “quarantine and forget” patterns.
// Use a custom tag
test.skip('checkout calculates tax', { flaky: 'JIRA-1234' }, () => {
// Test code
});
Alternatively, use directory structure for teams that prefer filesystem-based organization:
tests/
stable/
quarantine/
checkout-tax.test.js # Moved here when flagged
2. Separate Pipeline Stages
The key insight here is decoupling. Quarantined tests still run, so you maintain visibility into their behavior. But they don’t block deployments, so engineers aren’t forced to ignore failures or rerun builds.
stages:
- name: stable-tests
script: npm test -- stable/
# Blocks deployment
- name: quarantine-tests
script: npm test -- quarantine/
allow_failure: true
# Runs but doesn't block
This setup preserves the CI signal while removing the friction. Engineers can ship, and the quarantine report still shows which tests need repair.
3. Monitor Quarantine Metrics
Quarantine without monitoring becomes a graveyard. Tests enter and never leave. The quarantine percentage creeps up until it represents a significant portion of your suite.
Track these three signals to prevent quarantine from becoming a dumping ground:
- Quarantine size over time — Is it growing or shrinking? Growth means you’re adding flaky tests faster than you’re fixing them.
- Days in quarantine per test — Set a maximum (30 days is reasonable). Tests older than that need a decision: fix or delete.
- Entry/exit rate — Healthy quarantine has balanced flow. Tests enter, get fixed, and exit. One-way flow into quarantine is a warning sign.
4. Prevent Quarantine Rot
Enforce quarantine limits in CI. When the quarantine grows too large, make it visible by failing builds. This creates pressure to fix or delete rather than accumulate.
- name: Quarantine size check
run: |
COUNT=$(find tests/quarantine -name "*.test.js" | wc -l)
TOTAL=$(find tests -name "*.test.js" | wc -l)
PERCENTAGE=$((COUNT * 100 / TOTAL))
if [ $PERCENTAGE -gt 10 ]; then
echo "::error::Quarantine exceeds 10% of test suite"
exit 1
fi
Ten percent is a reasonable ceiling. Beyond that, quarantine stops being a temporary holding area and starts being a second test suite that nobody trusts.
How Pipeline Hardening Prevents Flakiness
Detection catches flakiness after it exists. Hardening prevents it from entering the pipeline at all.
1. Pre-Merge Stability Gates
Require new tests to pass multiple times:
stability-gate:
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
script:
- NEW_TESTS=$(git diff --name-only origin/main | grep '\.test\.')
- |
for test in $NEW_TESTS; do
for i in $(seq 1 5); do
npm test -- $test || exit 1
done
done
2. Environment Consistency
Version drift is a silent flakiness generator. Pin everything:
# Pin everything
services:
db:
image: postgres:14.8 # Not postgres:14 or postgres:latest
redis:
image: redis:7.0.12
env:
NODE_VERSION: 18.17.1
BROWSER_VERSION: chrome-115
Use containers for browser tests:
browser-tests:
image: mcr.microsoft.com/playwright:v1.38.0
# Identical environment everywhere
3. Resource Isolation
Tests competing for CPU, memory, or ports create timing-dependent failures:
# Limit parallelism to prevent resource contention
parallel:
matrix:
- SHARD: [1, 2, 3, 4]
max-concurrent: 2 # Don't starve the machine
Allocate dedicated resources per shard:
services:
- name: postgres
alias: db-$SHARD # Each shard gets its own DB
Why Parallel Execution Creates Flakiness
Parallelism speeds up CI but introduces new flakiness vectors. These three show up most often.
1. Race Conditions
Tests running simultaneously can interfere:
// Both tests hit the same user record
// Whichever finishes first wins
test('update user name', async () => {
await db.update('users', { id: 1, name: 'Test A' });
});
test('update user email', async () => {
await db.update('users', { id: 1, email: '[email protected]' });
});
Fix: Use unique test data with proper test isolation:
test('update user name', async () => {
const userId = await createTestUser();
await db.update('users', { id: userId, name: 'Test A' });
});
2. Port Conflicts
Multiple test shards try to bind the same port. This is especially common with integration tests that spin up servers.
Error: EADDRINUSE: port 3000 already in use
The error is intermittent because it depends on execution order. Sometimes one shard finishes before another starts. Sometimes they collide.
Fix: Dynamic port allocation. Let the OS assign an available port instead of hardcoding:
const server = app.listen(0); // Let OS assign port
const port = server.address().port;
For tests that need predictable ports, use shard-based ranges:
env:
TEST_PORT: $((3000 + $SHARD * 100))
3. Shared State Across Workers
When workers share databases, caches, or file systems, one worker’s setup can corrupt another worker’s assertions. Test A creates a user, Test B deletes all users, Test A’s assertion fails. The test isn’t flaky. The isolation is broken.
Fix: Worker-specific isolation. Give each worker its own sandbox:
// Each worker gets its own database schema
const schema = `test_worker_${process.env.JEST_WORKER_ID}`;
beforeAll(() => db.createSchema(schema));
afterAll(() => db.dropSchema(schema));
Measuring Flaky Test Costs
What gets measured gets managed. These metrics make the cost impossible to ignore.
1. Key Metrics to Track
| Metric | How to Calculate | Target |
|---|---|---|
| Flaky rate | Tests with mixed pass/fail same commit / total tests | <2% |
| CI false positive rate | Failed builds without code issues / total builds | <1% |
| Mean time to investigate | Time from failure to resolution | <15 min |
| Quarantine percentage | Quarantined tests / total tests | <5% |
| Retry compute cost | Retry job minutes / total job minutes | <10% |
2. Dashboard Components
Build dashboards with three essential views:
- Flaky test leaderboard — Top 10 flakiest tests, sortable by failure rate, days in repo, and cost (retries × build time)
- Trend graphs — Flaky rate over time. Are you getting better or worse?
- Developer impact — Hours lost to flaky test investigation per week. This number gets leadership attention.
3. Sample Datadog/Grafana Query
avg:ci.test.flaky_rate{env:production, team:*} by {team}
Display prominently. Teams start paying attention when their name appears on a public dashboard.
Once you have this visibility, patterns emerge. Some teams discover their flaky rate is manageable with better retry policies. Others realize they’re spending more on test maintenance than feature development. That clarity is the point. You can’t fix what you can’t see, and you can’t prioritize what you can’t quantify.
Teams hitting the latter wall often find that autonomous testing changes the equation entirely. When tests adapt to timing and UI changes automatically, the flakiness vectors covered in this guide stop being daily fires.
Your Pipeline Should Be a Confidence Machine
Every false positive erodes the value of every real signal. That’s the math of flaky tests in CI/CD.
When your pipeline says a test failed, engineers should investigate the code, not the infrastructure. That’s the bar. Anything less means real bugs slip through while teams chase ghosts.
Zero flaky tests isn’t realistic with scripted automation. Some timing sensitivity is inevitable when tests encode implementation details. But teams using Pie are experiencing exactly that. Pipelines where every failure is a real bug, not infrastructure noise.
If you’ve implemented everything here and test maintenance still dominates your sprint cycles, the problem isn’t configuration. It’s the scripted testing model itself.
Stop Chasing Flaky Failures
See how teams are shipping faster with pipelines that only fail on real bugs.
Book a DemoFrequently Asked Questions
Not unless they cover critical flows. Blocking on known-flaky tests trains engineers to ignore failures. Better to quarantine and track separately than to cry wolf on every build.
Three is the practical maximum. More retries mask real failures. If a test needs more than three attempts to pass consistently, fix the root cause instead of hiding it with retries.
Yes, and you should. Stable tests gate deployments; flaky tests run in parallel for monitoring. This preserves signal quality without losing coverage intent.
Atlassian found flaky tests caused 21% of their master build failures and wasted 150,000 developer hours annually. Each false positive adds 20-60 minutes of investigation time.
Keep quarantine under 5% of your total test suite. Higher than 10% signals a systemic problem with test design, not individual test flakiness. If quarantine keeps growing, stop adding tests and fix the root causes.
Track retry compute costs and developer investigation time. Convert hours to dollars using average engineering salary. A dashboard showing weekly cost in dollars gets attention faster than abstract flakiness percentages.
Depends on what they cover. If the test validates critical functionality, fix it. If it tests edge cases that rarely matter in production, delete it. A smaller, reliable suite beats a large, flaky one.
Usually environmental differences. Local machines have more resources, consistent timing, and no parallelism pressure. CI environments share resources, run tests concurrently, and have variable network latency. Pin versions, isolate resources, and run tests in containers to match environments.
References
- Atlassian Engineering (2024). “Reducing Flaky Test Impact on Developer Productivity”
- Google Engineering (2023). “Flaky Tests at Google and How We Mitigate Them”
- Microsoft Engineering (2024). “Improving Developer Productivity via Flaky Test Management”
- CircleCI (2025). “How to Reduce Flaky Test Failures in CI/CD”
- Trunk.io (2025). “The Ultimate Guide to Flaky Tests in CI Pipelines”
13 years building mobile infrastructure at Square, Facebook, and Instacart. Payment systems, video platforms, the works. Now building the QA platform he wished existed the whole time. LinkedIn →