Guide

Non-Deterministic Tests: What They Are and How to Fix Them

Non-deterministic tests produce different results on identical code. Six sources of non-determinism and how to eliminate each one from your test suite.

Dhaval Shreyas

Co-founder & CEO at Pie

9 min read

What you’ll learn

Six specific causes of non-deterministic test behavior
Why non-determinism makes continuous delivery impossible
How to identify non-deterministic tests in your suite
When to fix non-deterministic tests vs. quarantine them

A test passes. You run it again. It fails. Nothing changed.

This is non-determinism, also known as flaky test behavior. The outcome changes without the code changing. It’s why engineers stop trusting automated tests, why CI pipelines get ignored, and why teams revert to manual testing after investing months in automation.

The pattern shows up everywhere. What’s changed recently is the proliferation of sources. Modern applications integrate with more services, run in more environments, and increasingly incorporate AI features that produce variable outputs by design.

Understanding where non-determinism comes from is the first step toward eliminating it. Or, when elimination isn’t possible, containing it.

What Makes a Test Non-Deterministic

A non-deterministic test produces different results from identical inputs. A timestamp shifted between runs. Another test left data behind. The CI server was under heavier load than your laptop. These are the kinds of factors that sneak into test execution and change outcomes without changing code.

Compare that to a deterministic test, which produces the same result every time when given the same inputs. One input, one output, always. That’s the test you can trust.

With non-deterministic tests, a failure might indicate a bug. Or it might indicate that Mercury was in retrograde during the test run. You don’t know which.

The standard response is re-running. “Just run it again and see if it passes.” This is how trust erodes. Once engineers learn that failures don’t necessarily mean anything, they stop paying attention. Real bugs get dismissed as flakiness. The suite becomes noise.

Non-determinism isn’t always the test’s fault. Sometimes the application itself behaves non-deterministically, and the test correctly detects that behavior. But distinguishing between a non-deterministic test and a non-deterministic application requires investigation. Most teams don’t have time for that investigation on every failure.

6 Sources of Non-Determinism

Non-determinism doesn’t appear out of nowhere. It creeps in through specific, identifiable patterns. Once you know what to look for, you can eliminate each source systematically.

1. Time Dependencies

Any test that reads the system clock is potentially non-deterministic.

// Non-deterministic: depends on current time
test('shows greeting based on time of day', () => {
  const greeting = getGreeting();
  expect(greeting).toBe('Good morning'); // Fails after noon
});

// Deterministic: controls time input
test('shows greeting based on time of day', () => {
  jest.useFakeTimers().setSystemTime(new Date('2026-03-15T09:00:00'));
  const greeting = getGreeting();
  expect(greeting).toBe('Good morning');
});

Time dependencies also appear in less obvious forms: cache expiration, session timeouts, rate limiting based on time windows.

2. Asynchronous Execution

Tests that don’t properly await async operations fail intermittently based on timing.

// Non-deterministic: race condition
test('saves user data', () => {
  saveUser({ name: 'Test' });
  const user = getUser(); // Might execute before save completes
  expect(user.name).toBe('Test');
});

// Deterministic: properly awaited
test('saves user data', async () => {
  await saveUser({ name: 'Test' });
  const user = await getUser();
  expect(user.name).toBe('Test');
});

The most insidious cases are tests that usually await correctly but occasionally hit a race condition in setup or teardown.

3. Shared Mutable State

When tests share data, the order they run in affects their outcome.

Test A creates a user with email “[email protected]”. Test B expects no users to exist. If A runs first, B fails. If B runs first, both pass.

Test isolation solves this. Each test should set up its own state and clean up after itself.

4. Test Order Dependencies

Related to shared state, but distinct: tests that explicitly depend on previous tests having run.

// Non-deterministic: depends on previous test
let userId;

test('creates user', async () => {
  const user = await createUser({ name: 'Test' });
  userId = user.id; // Used by next test
});

test('updates user', async () => {
  await updateUser(userId, { name: 'Updated' }); // Fails if first test didn't run
});

Tests should be independent. Each test should be runnable in isolation.

5. Collection Ordering

Data structures with undefined iteration order can cause tests to fail when iteration happens differently.

// Non-deterministic: HashMap/Set iteration order not guaranteed
test('processes all items', () => {
  const items = new Set(['a', 'b', 'c']);
  const result = processInOrder(items);
  expect(result).toBe('a,b,c'); // Order not guaranteed
});

// Deterministic: explicit ordering
test('processes all items', () => {
  const items = new Set(['a', 'b', 'c']);
  const result = processInOrder([...items].sort());
  expect(result).toBe('a,b,c');
});

This is especially common in languages where hash map iteration order isn’t defined (Java, some JavaScript environments).

6. External Service Dependencies

Tests calling real external services inherit that service’s variability: network latency, rate limits, occasional downtime, changing response formats.

// Non-deterministic: real API call
test('fetches weather data', async () => {
  const weather = await fetch('https://api.weather.com/current');
  expect(weather.temperature).toBeDefined();
});

// Deterministic: mocked service
test('fetches weather data', async () => {
  mockFetch('https://api.weather.com/current', {
    temperature: 72,
    conditions: 'sunny'
  });
  const weather = await fetch('https://api.weather.com/current');
  expect(weather.temperature).toBe(72);
});

How Non-Determinism Destroys CI/CD

Continuous delivery requires confidence. You merge code, tests pass, it ships. That pipeline only works if test results mean something.

When tests are non-deterministic, the pipeline breaks down:

False failures block legitimate changes — Engineers can’t merge because a test failed, but the failure has nothing to do with their code. They re-run CI. It passes. Time wasted.
False passes hide real bugs — Worse than false failures: a non-deterministic test might pass by luck when it should have caught a bug. The bug ships.
Signal degrades to noise — Once engineers experience enough false failures, they stop trusting failures. “Just re-run it” becomes standard procedure. Real bugs get dismissed as flakiness.
Velocity drops — Engineers spend more time investigating false failures than writing features.
Deployment confidence disappears — If you can’t trust tests, you can’t trust deploys. Teams revert to manual verification, eliminating the speed advantage automation was supposed to provide.

Identifying Non-Deterministic Tests

You can’t fix what you can’t find. The tricky part is that non-deterministic tests often pass 95% of the time. Just often enough to avoid suspicion, but frequently enough to erode trust over months.

1. Repeated Execution

Run the same test multiple times. If it fails once out of 20 runs, it’s non-deterministic.

# Run test 20 times
for i in {1..20}; do npm test -- tests/checkout.spec.js; done

2. Random Test Ordering

Shuffle test execution order. If tests pass in alphabetical order but fail randomly, you have order dependencies.

# Jest
npm test -- --randomize

# pytest
pytest --random-order

3. Parallel Execution

Run tests concurrently across multiple workers. This exposes shared state issues that sequential execution masks. A test that passes when it owns the database but fails when another test writes to the same table has an isolation problem, not a code problem.

npm test -- --workers=10

4. Quarantine Tracking

Mark tests that fail without code changes. Over time, patterns emerge. The quarantine should shrink, not grow.

Fixing vs. Quarantining

Not every non-deterministic test deserves immediate fixing. Prioritize based on coverage value.

Fix immediately:

Tests covering critical paths (checkout, authentication, payment)
Tests with high failure rates (>10% flakiness)
Tests blocking deployments frequently

Quarantine and schedule:

Tests covering lower-priority features
Tests with low failure rates that rarely block
Tests requiring significant refactoring to fix

Delete:

Tests covering trivial functionality
Tests that have been quarantined for months
Tests where fixing cost exceeds rewriting cost

Quarantine isn’t forgiveness. It’s triage. A test in quarantine isn’t helping your regression coverage. Track quarantine size over time; it should decrease.

Testing Non-Deterministic Systems

Some systems are genuinely non-deterministic. AI features producing varied outputs. Recommendation engines showing different results. A/B tests with randomized variations.

Traditional deterministic testing doesn’t apply to these systems. Four approaches work instead:

Property-based testing — Verify that outputs meet invariants without requiring exact matches. Check that recommendations exist, fall within expected counts, and contain valid items rather than matching specific products.

// Don't check exact output
test('recommendation engine returns products', async () => {
  const recs = await getRecommendations(userId);
  // Verify properties, not exact values
  expect(recs.length).toBeGreaterThan(0);
  expect(recs.length).toBeLessThanOrEqual(10);
  expect(recs.every(r => r.inStock)).toBe(true);
});

Snapshot ranges — Verify outputs fall within acceptable bounds rather than matching exactly. A response time between 100-500ms passes; exact match to 247ms would fail on natural variation.
Service virtualization — Capture real service responses once and replay them deterministically. The non-determinism happens during recording; tests replay the same response every time.
Vision-based testing — Verify what users actually see rather than internal state. For AI-generated content, check that output looks reasonable without requiring exact matches. Platforms like Pie use this approach to test applications with inherently variable outputs.

For applications incorporating AI features, the testing paradigm needs to shift. You’re not verifying exact outputs anymore. You’re verifying that outputs fall within acceptable ranges and that the user experience makes sense.

Determinism Is a Design Choice

Non-deterministic tests aren’t inevitable. They’re symptoms of hidden dependencies: time, async execution, shared state, test ordering, collection iteration, and external services.

Eliminate the dependencies, and the tests become deterministic. Mock time. Await properly. Isolate state. Randomize test order to catch hidden dependencies. Mock external services.

For systems that are genuinely non-deterministic (AI features, recommendation engines, randomized content), shift from exact-match testing to property-based and vision-based verification. The goal is confidence that the system behaves correctly, not that it produces identical outputs.

Your test suite should reflect your code quality. When it reflects timing luck or execution order instead, something is broken. Fix it.

Less Maintenance. More Shipping.

See how teams are making the shift to zero-maintenance testing.

Book a Demo

Frequently Asked Questions

They're the same thing. 'Flaky' is the colloquial term; 'non-deterministic' is the technical one. Both describe tests that produce different results without code changes.

Zero is the goal. Any test that can fail without a code change erodes team trust in the test suite. Google's research found 16% of their tests exhibit flaky behavior, and they invest heavily in reducing it. This is a problem to solve, not a benchmark to accept.

Depends on coverage value. A non-deterministic test covering critical checkout flow is worth fixing. A non-deterministic test validating a loading spinner animation? Delete it. Your time is better spent elsewhere.

Yes, in two ways. First, autonomous platforms handle timing and state isolation by design, eliminating common sources of non-determinism. Second, for genuinely non-deterministic systems (AI features, randomized content), vision-based testing can verify that outputs are reasonable without requiring exact matches.

Run tests in random order with Jest's --randomize or pytest's --random-order flag. Run with parallel workers to expose shared state issues. Track tests that fail without code changes over multiple builds. If a test passes individually but fails in the full suite, or passes on retry, it's non-deterministic.

Missing await statements, race conditions between test setup and assertions, hardcoded timeouts instead of proper waits, and tests that don't wait for network requests to complete. The fix is explicit awaiting of all async operations and using test framework utilities like waitFor instead of setTimeout.

Pie runs each test in a completely isolated browser context with no shared state, eliminating the most common sources of non-determinism. For applications with genuinely variable outputs (AI features, dynamic content), vision-based verification checks what users see rather than exact values.

Local machines have more resources, different timing, and cached state from previous runs. CI starts fresh each time, exposes race conditions that local resources mask, and runs tests in parallel. The fix is ensuring tests don't depend on local state, specific timing, or sequential execution.

Dhaval Shreyas

Co-founder & CEO at Pie

13 years building mobile infrastructure at Square, Facebook, and Instacart. Payment systems, video platforms, the works. Now building the QA platform he wished existed the whole time. LinkedIn →