Detection
Last updated
Last updated
Trunk Flaky Tests detect flaky tests by analyzing test results. The health of your tests is displayed in the Flaky Tests dashboard. This page covers how flaky tests are detected and how to analyze your test suite’s health using the dashboard.
Trunk typically requires 10+ runs per test on CI to start accurately detecting flaky tests. For example, detecting a flaky test that fails 25% of the time takes 9 runs to achieve 90% confidence in having seen it flake. Depending on the repository’s velocity, this could take hours or days.
Trunk detects flaky tests by analyzing the test results uploaded from your CI jobs. Each new upload is processed and compared with historical test results to detect flaky tests. Trunk emphasizes each result differently depending on which branch it's run on. This is an asynchronous process and may take up to an hour before each upload's results are reflected in the dashboard.
If you have PR Comments enabled, you can follow the link in the PR comments to see a report for each upload.
Trunk classifies all tests into one of three categories based on the history of each test:
Flaky
This test is not deterministic. Given the same inputs, the test will occasionally produce different outputs. This means you cannot trust the results of these tests.
Broken
This test is reproducible but is always failing. These tests that always fail are not useful and should be fixed.
Healthy
This test is reproducible. Given the same inputs, the test will produce the same outputs.
Trunk analyzes test failures based on the context in which they are run. A test failing on main
has a different impact on flake detection than a test failing on a pull request.
Uploading all test results from your repository will result in the fastest and most accurate detection. Trunk relies on test results from main
, pull requests, and (if you use one) merge queues.
Merge queues use temporary branches to test changes again before merging into main
. Failures on merge queue branches are unexpected and are used as a signal when detecting flaky tests. Trunk currently auto-detects merge queue CI jobs from Trunk Merge Queues, GitHub Merge Queues, GitLab Merge Trains, and Graphite Merge Queues.
Tests that are run on pull requests are expected to fail, so failures on pull requests are not used in the detection of flaky tests.
Flaky tests will produce inconsistent results even when run on the same code with the same input. Pull requests is where we see this behavior the most often: an engineer opens a pull request, sees a test fail, re-runs the code, and sees the test pass. We track this behavior (different results for a test on the same git commit) as sign that a test is flaky.
Expect test results for individual PRs to be up to date for PR Test Summaries within 15 minutes and all other metrics to be up to date within an hour of a new upload.
A test’s health status transitions between broken, flaky, and healthy as new test runs with new results are uploaded to Trunk Flaky Tests. Trunk Flaky Tests determine if a test is flaky based on analyzing the results of recent runs. The process is deterministic and based on appropriate thresholds.
This means if a test is healthy, it can transition into a broken or flaky status after new results appear that show failures. This also means if a test that was previously labeled as broken or flaky sees consistently passing runs, it can transition into a healthy test.
Tests may transition between flaky, broken, and healthy states multiple times over their lifetime. You can see previous changes in the detected health status of a test under Status History, as well as an explanation for why it was detected to have a new state.
Want to chat with Trunk's engineers? Join us and 1500+ fellow engineers on Slack.