Anti-flake protection

What it is

Some CI jobs fail for reasons unrelated to a PR's code change, such as due to flaky tests or a CI runner disconnecting. These failures are usually cleared when the CI job is rerun. If a second PR that depends on the first does pass, it is very likely that the first PR was good and simply experienced a transient failure.

Trunk Merge Queue can use the combination of Optimistic Merging and Pending Failure Depth to merge pull requests that would otherwise be rejected from the queue.

If you have a lot of flaky tests in your projects, you should track and fix them with Trunk Flaky Tests. Anti-flake protection helps reduce the impact of flaky tests but doesn't help you detect, track, and eliminate them.

In the video below, you can see an example of this anti-flake protection:

what's happening?

queue

A, B, C begin predictive testing

main <- A <- B+a <- C+ba

B fails testing

main <- A <- B+a <- C+ba

predictive failure depth keeps B from being evicted while C tests

main <- A <- B+a (hold) <- C+ba

C passes

main <- A <- B+a <- C+ba

optimistic merging allows A, B, C to merge

merge A B C

Optimistic Merging only works when the Pending Failure Depth is set to a value greater than zero. When zero or disabled, Merge will not hold any failed tests in the queue.

Why use it

Eliminate false negatives - Flaky tests cause 20-40% of PR failures in typical pipelines. Anti-flake protection helps get these under control, so developers don't waste time investigating non-issues.
Maintain developer confidence - When the queue rejects PRs for real reasons (not flaky tests), developers trust the system. Reduces "it's probably just flaky" dismissiveness of real failures.
Reduce manual retries - Developers don't need to manually resubmit PRs or click "retry" when tests flake. Trunk handles it automatically, saving time and frustration.
Keep queue moving - Flaky tests don't stall the queue. PRs that would have been blocked by transient failures merge successfully, increasing overall throughput.

How to enable

Anti Flake Protection is active when Optimistic Merge Queue is On and Pending Failure Depth is set to a value greater than zero

Enable Optimistic merging in Settings > Repositories > your repository > Merge Queue > toggle On Optimistic Merge Queue.

Configure Pending Failure Depth in Settings > Repositories > your repository > Merge Queue > select a value from the Pending Failure Depth dropdown.

Tradeoffs and considerations

What you gain

80-90% reduction in flaky test blocks - Most flaky failures are caught and handled automatically
Developer time saved - No manual retries or investigation of flaky failures
Higher queue throughput - Flaky tests don't stall the queue
Better developer experience - Less frustration with non-deterministic failures

What you give up or risk

Increased CI cost - Retrying tests costs additional CI resources (typically 10-20% increase)
Slightly longer merge times - PRs that fail then retry take longer than PRs that pass first time
Potential false positives - Occasionally a legitimate failure might be retried (though Trunk is conservative)
Masks underlying problems - Flaky tests indicate test quality issues; retrying treats symptom, not cause

When NOT to use anti-flake protection

Don't enable anti-flake protection if:

Your tests are not flaky (< 2% flake rate) - No benefit, only cost
CI resources are extremely limited - Retries double test costs for flaky PRs
You're actively fixing flaky tests - Better to fix than to mask
Flaky tests indicate real issues - Sometimes "flaky" failures reveal race conditions or timing issues in your code

When to use anti-flake protection

Do enable anti-flake protection when:

Flaky tests are blocking PRs (5-15% flake rate) - Clear benefit outweighs cost
Fixing flaky tests will take time - Use this as interim solution while improving test quality
Infrastructure flakiness - Network timeouts, resource contention you can't control
Third-party dependencies are flaky - External APIs or services cause transient failures

The right long-term solution

️ Anti-flake protection is a band-aid, not a cure.

The right approach:

Enable anti-flake protection - Unblock your team immediately
Identify flaky tests - Use CI analytics to find which tests flake most
Fix the root causes - Make tests deterministic, add retries at test level, improve infrastructure
Reduce flake rate over time - Goal should be < 2% flake rate
Consider disabling - Once tests are stable, anti-flake protection becomes unnecessary

Red flags indicating systemic issues:

Flake rate > 20% (your tests are broken)
Same tests flake repeatedly (specific tests need fixing)
All flakes are in one area (infrastructure or test framework issue)

Common misconceptions

Misconception: "Anti-flake protection lets me ignore flaky tests"
- Reality: NO! This is a temporary solution. Flaky tests are a code/test quality problem that must be fixed. Anti-flake protection buys you time to fix them properly.
Misconception: "It retries all failures automatically"
- Reality: Trunk is selective. Only failures that match flaky patterns are retried. Legitimate failures still block PRs immediately.
Misconception: "Anti-flake protection wastes tons of CI resources"
- Reality: Typical cost increase is 10-20% for teams with moderate flake rates. This is far less than the developer time wasted investigating flaky failures.
Misconception: "I should set retry limit to 10 to catch all flakes"
- Reality: If you need 10 retries, your tests are catastrophically broken. Fix the tests! Retry limit should be 1-3 max.

Next Steps

Last updated 2 months ago

hashtagWhat it is

hashtagWhy use it

hashtagHow to enable

hashtagTradeoffs and considerations

hashtagWhat you gain

hashtagWhat you give up or risk

hashtagWhen NOT to use anti-flake protection

hashtagWhen to use anti-flake protection

hashtagThe right long-term solution

hashtagCommon misconceptions

hashtagNext Steps

What it is

Why use it

How to enable

Tradeoffs and considerations

What you gain

What you give up or risk

When NOT to use anti-flake protection

When to use anti-flake protection

The right long-term solution

Common misconceptions

Next Steps