Flaky Test Detection
Learn how Trunk detects and labels flaky tests
Last updated
Learn how Trunk detects and labels flaky tests
Last updated
Trunk detects flaky tests by analyzing test results uploaded from your CI jobs. This page covers how flaky tests are detected and how they're labeled after Trunk receives uploaded test results.
Trunk Flaky Tests detects flaky tests by analyzing test results. The health of your tests is displayed in the Flaky Tests dashboard. This page covers how flaky tests are detected and how to analyze your test suite’s health using the dashboard.
You can learn more about how tests are uploaded to Trunk before they're labeled in the Get Started docs. You can learn more about how detection results are displayed in the Dashboard docs.
Trunk typically requires 10+ runs per test on CI to start accurately detecting flaky tests. For example, detecting a flaky test that fails 25% of the time takes 9 runs to achieve 90% confidence in having seen it flake. Depending on the repository’s velocity, this could take hours or days.
Trunk detects flaky tests by analyzing the test results uploaded from your CI jobs. Each new upload is processed and compared with historical test results to detect flaky tests. Trunk emphasizes each result differently depending on which branch it's run on. This is an asynchronous process, and it may take up to an hour for an upload's results to be reflected in the dashboard.
Trunk analyzes test failures based on the context in which they are run. A test failing on main
has a different impact on flake detection than a test failing on a pull request. After tests are uploaded to Trunk, they're analyzed based on different rules depending on which branch they were run on.
Uploading all test results from your repository will result in the fastest and most accurate detection. Trunk relies on test results from main
, pull requests, and (if you use one) merge queues.
Trunk detects flaky tests with the assumption that automated tests should be passing before being merged into stable branches like main
. This means failures on main
are unexpected and indicate flakiness or a broken test.
Stable branches are sometimes referred to as protected or default branches.
Flaky Tests will look for main
to use as a stable branch by default. You can override the default selection and set a custom stable branch, for example, master
ordevelop
.
It is important to set your stable branch correctly to ensure fast and accurate detection of flaky tests.
Flaky Test users with the administrator role can update the current stable branch in the repository settings:
Click on your profile and open Settings.
Click on your repository in the left nav.
Update the Override Default Stable Branch setting with the name of your stable branch.
Changing the stable branch will not re-build your test history, a stable branch change will only be applied to new test runs.
Flaky Tests will require additional CI runs on the updated stable branch to detect test flakes.
Tests run on pull requests are expected to fail, so these PR failures are not directly used to detect flaky tests.
Flaky tests will produce inconsistent results even when run on the same code with the same input. Pull requests are where we see this behavior the most often: an engineer opens a pull request, sees a test fail, re-runs the code, and sees the test pass. If a test is detected to produce different results on the same git commit, which means different results on the same code, we consider that test to be flaky.
Merge queues use temporary branches to test changes again before merging into main
. Failures on merge queue branches are unexpected and are used as a signal when detecting flaky tests. Trunk currently auto-detects merge queue CI jobs from Trunk Merge Queues, GitHub Merge Queues, GitLab Merge Trains, and Graphite Merge Queues.
Expect test results for individual PRs to be up-to-date for PR Test Summaries within 15 minutes post-upload and all other metrics to be up-to-date within an hour.
Trunk classifies all tests into one of three categories based on the history of each test:
Flaky
This test is not deterministic. Given the same inputs, the test will occasionally produce different outputs. This means you cannot trust the results of these tests.
Broken
This test is reproducible but is always failing. These tests that always fail are not useful and should be fixed.
Healthy
This test is reproducible. Given the same inputs, the test will produce the same outputs.
If you have not set up your CI jobs to upload results to Trunk, follow the Get Started docs to start uploading test results to Trunk.
If you're curious about why certain tests are labeled as flaky, you can visit each test's status history. Learn more about Status History.