Anti-flake protection

What it is

Some CI jobs fail for reasons unrelated to a PR's code change, such as due to flaky tests or a CI runner disconnecting. These failures are usually cleared when the CI job is rerun. If a second PR that depends on the first does pass, it is very likely that the first PR was good and simply experienced a transient failure.

Trunk Merge Queue can use the combination of Optimistic Merging and Pending Failure Depth to merge pull requests that would otherwise be rejected from the queue.

In the video below, you can see an example of this anti-flake protection:

Anti-flake protection with optimistic merging + pending failure depth
what's happening?
queue

A, B, C begin predictive testing

main <- A <- B+a <- C+ba

B fails testing

main <- A <- B+a <- C+ba

predictive failure depth keeps B from being evicted while C tests

main <- A <- B+a (hold) <- C+ba

C passes

main <- A <- B+a <- C+ba

optimistic merging allows A, B, C to merge

merge A B C

Optimistic Merging only works when the Pending Failure Depth is set to a value greater than zero. When zero or disabled, Merge will not hold any failed tests in the queue.

Why use it

  • Eliminate false negatives - Flaky tests cause 20-40% of PR failures in typical pipelines. Anti-flake protection helps get these under control, so developers don't waste time investigating non-issues.

  • Maintain developer confidence - When the queue rejects PRs for real reasons (not flaky tests), developers trust the system. Reduces "it's probably just flaky" dismissiveness of real failures.

  • Reduce manual retries - Developers don't need to manually resubmit PRs or click "retry" when tests flake. Trunk handles it automatically, saving time and frustration.

  • Keep queue moving - Flaky tests don't stall the queue. PRs that would have been blocked by transient failures merge successfully, increasing overall throughput.

How to enable

Anti Flake Protection is active when Optimistic Merge Queue is On and Pending Failure Depth is set to a value greater than zero

Enable Optimistic merging in Settings > Repositories > your repository > Merge Queue > toggle On Optimistic Merge Queue.

Configure Pending Failure Depth in Settings > Repositories > your repository > Merge Queue > select a value from the Pending Failure Depth dropdown.

Tradeoffs and considerations

What you gain

  • 80-90% reduction in flaky test blocks - Most flaky failures are caught and handled automatically

  • Developer time saved - No manual retries or investigation of flaky failures

  • Higher queue throughput - Flaky tests don't stall the queue

  • Better developer experience - Less frustration with non-deterministic failures

What you give up or risk

  • Increased CI cost - Retrying tests costs additional CI resources (typically 10-20% increase)

  • Slightly longer merge times - PRs that fail then retry take longer than PRs that pass first time

  • Potential false positives - Occasionally a legitimate failure might be retried (though Trunk is conservative)

  • Masks underlying problems - Flaky tests indicate test quality issues; retrying treats symptom, not cause

When NOT to use anti-flake protection

Don't enable anti-flake protection if:

  • Your tests are not flaky (< 2% flake rate) - No benefit, only cost

  • CI resources are extremely limited - Retries double test costs for flaky PRs

  • You're actively fixing flaky tests - Better to fix than to mask

  • Flaky tests indicate real issues - Sometimes "flaky" failures reveal race conditions or timing issues in your code

When to use anti-flake protection

Do enable anti-flake protection when:

  • Flaky tests are blocking PRs (5-15% flake rate) - Clear benefit outweighs cost

  • Fixing flaky tests will take time - Use this as interim solution while improving test quality

  • Infrastructure flakiness - Network timeouts, resource contention you can't control

  • Third-party dependencies are flaky - External APIs or services cause transient failures

The right long-term solution

The right approach:

  1. Enable anti-flake protection - Unblock your team immediately

  2. Identify flaky tests - Use CI analytics to find which tests flake most

  3. Fix the root causes - Make tests deterministic, add retries at test level, improve infrastructure

  4. Reduce flake rate over time - Goal should be < 2% flake rate

  5. Consider disabling - Once tests are stable, anti-flake protection becomes unnecessary

Red flags indicating systemic issues:

  • Flake rate > 20% (your tests are broken)

  • Same tests flake repeatedly (specific tests need fixing)

  • All flakes are in one area (infrastructure or test framework issue)

Common misconceptions

  • Misconception: "Anti-flake protection lets me ignore flaky tests"

    • Reality: NO! This is a temporary solution. Flaky tests are a code/test quality problem that must be fixed. Anti-flake protection buys you time to fix them properly.

  • Misconception: "It retries all failures automatically"

    • Reality: Trunk is selective. Only failures that match flaky patterns are retried. Legitimate failures still block PRs immediately.

  • Misconception: "Anti-flake protection wastes tons of CI resources"

    • Reality: Typical cost increase is 10-20% for teams with moderate flake rates. This is far less than the developer time wasted investigating flaky failures.

  • Misconception: "I should set retry limit to 10 to catch all flakes"

    • Reality: If you need 10 retries, your tests are catastrophically broken. Fix the tests! Retry limit should be 1-3 max.

Next Steps

If you have a lot of flaky tests in your projects, you should track and fix them with Trunk Flaky Tests. Anti-flake protection helps reduce the impact of flaky tests but doesn't help you detect, track, and eliminate them.

Last updated