Rewrite CI documentation

This commit is contained in:
Jakub Beránek 2024-05-06 13:59:56 +02:00 committed by Jakub Beránek
parent c2eb5560d2
commit 265c59a42c
1 changed files with 241 additions and 48 deletions

View File

@ -1,54 +1,97 @@
# Testing with CI
## Testing infrastructure
The primary goal of our CI system is to ensure that the `master` branch of `rust-lang/rust` is always in a valid state and passes our test suite.
From a high-level point of view, when you open a pull request at `rust-lang/rust`,
the following will happen:
- A small [subset](#pull-request-builds) of tests and checks are run after each push to the PR. This should help catching common errors.
- When the PR is approved, the [bors] bot enqueues the PR into a [merge queue].
- Once the PR gets to the front of the queue, bors will create a merge commit
and run the [full test suite](#auto-builds) on it. The merge commit either contains only one specific PR or it can be a ["rollup"](#rollups) which combines multiple PRs together, to save CI costs.
- Once the whole test suite finishes, two things can happen. Either CI fails with an error that needs to be addressed by the developer, or CI succeeds and the merge commit is then pushed to the `master` branch.
If you want to modify what gets executed on CI, see [Modifying CI jobs](#modifying-ci-jobs).
## CI workflow
<!-- date-check: may 2024 -->
When a Pull Request is opened on GitHub, [GitHub Actions] will automatically
launch a build that will run all tests on some configurations
(x86_64-gnu-llvm-X linux, x86_64-gnu-tools linux, mingw-check linux and mingw-check-tidy linux).
In essence, each runs `./x test` with various different options.
Our CI is primarily executed on [GitHub Actions], with a single workflow defined in [`.github/workflows/ci.yml`], which contains a bunch of steps that are unified for all CI jobs that we execute. When a commit is pushed to a corresponding branch or a PR, the workflow executes the [`calculate-job-matrix.py`] script, which dynamically generates the specific CI jobs that should be executed. This script uses the [`jobs.yml`] file as an input, which contains a declarative configuration of all our CI jobs.
The integration bot [bors] is used for coordinating merges to the master branch.
When a PR is approved, it goes into a [queue] where merges are tested one at a
time on a wide set of platforms using GitHub Actions. Due to the limit on the
number of parallel jobs, we run the main CI jobs under the [rust-lang-ci] organization
(in contrast to PR CI jobs, which run under `rust-lang` directly).
Most platforms only run the build steps, some run a restricted set of tests,
only a subset run the full suite of tests (see Rust's [platform tiers]).
> Almost all build steps shell out to separate scripts. This keeps the CI fairly platform independent (i.e., we are not overly reliant on GitHub Actions). GitHub Actions is only relied on for bootstrapping the CI process and for orchestrating the scripts that drive the process.
If everything passes, then all the distribution artifacts that were
generated during the CI run are published.
In essence, all CI jobs run `./x test`, `./x dist` or some other command with different configurations, across various operating systems, targets and platforms. There are two broad categories of jobs that are executed, `dist` and non-`dist` jobs.
- Dist jobs build a full release of the compiler for a specific platform, including
all the tools we ship through rustup; Those builds are then uploaded to the
`rust-lang-ci2` S3 bucket and are available to be locally installed with the
[rustup-toolchain-install-master] tool; The same builds are also used for
actual releases: our release process basically consists of copying those
artifacts from `rust-lang-ci2` to the production endpoint and signing them.
- Non-dist jobs run our full test suite on the platform, and the test suite of
all the tools we ship through rustup; The amount of stuff we test depends on
the platform (for example some tests are run only on Tier 1 platforms), and
some quicker platforms are grouped together on the same builder to avoid
wasting CI resources.
Based on an input event (usually a push to a branch), we execute one of three
kinds of builds (sets of jobs).
[rustup-toolchain-install-master]: https://github.com/kennytm/rustup-toolchain-install-master
### Pull Request builds
After each push to a pull request, a set of `pr` jobs are executed. Currently, these execute the
`x86_64-gnu-llvm-X`, `x86_64-gnu-tools`, `mingw-check` and `mingw-check-tidy` jobs, all running on Linux. These execute a relatively short (~30 minutes) and lightweight test suite that should catch common issues. More specifically, they run a set of lints, they try to perform a cross-compile check build to Windows mingw (without producing any artifacts) and they test the compiler using a *system* version of LLVM. Unfortunately, it would take too many resources to run the full test suite for each commit on every PR.
PR jobs are defined in the `pr` section of [`jobs.yml`]. They run under the `rust-lang/rust` repository, and their results can be observed directly on the PR, in the "CI checks" section at the bottom of the PR page.
### Auto builds
Before a commit can be merged into the `master` branch, it needs to pass our complete test suite. We call this an `auto` build. This build runs tens of CI jobs that exercise various tests across operating systems and targets.
The full test suite is quite slow; it can take two hours or more until all the `auto` CI jobs finish.
Most platforms only run the build steps, some run a restricted set of tests, only a subset run the full suite of tests (see Rust's [platform tiers]).
Auto jobs are defined in the `auto` section of [`jobs.yml`]. They are executed on the `auto` branch under the `rust-lang-ci/rust` repository[^rust-lang-ci] and their results can be seen [here](https://github.com/rust-lang-ci/rust/actions), although usually you will be notified of the result by a comment made by bors on the corresponding PR.
At any given time, at most a single `auto` build is being executed. Find out more [here](#merging-prs-serially-with-bors).
[GitHub Actions]: https://github.com/rust-lang/rust/actions
[rust-lang-ci]: https://github.com/rust-lang-ci/rust/actions
[bors]: https://github.com/rust-lang/homu
[queue]: https://bors.rust-lang.org/queue/rust
[platform tiers]: https://forge.rust-lang.org/release/platform-support.html#rust-platform-support
## Using CI to test
[^rust-lang-ci]: The `auto` and `try` jobs run under the `rust-lang-ci` fork for historical reasons. This may change in the future.
In some cases, a PR may run into problems with running tests on a particular
platform or configuration.
If you can't run those tests locally, don't hesitate to use CI resources to
try out a fix.
### Try builds
As mentioned above, opening or updating a PR will only run on a small subset
of configurations.
Only when a PR is approved will it go through the full set of test configurations.
However, you can try one of those configurations in your PR before it is approved.
For example, if a Windows build fails, but you don't have access to a Windows
machine, you can try running the Windows job that failed on CI within your PR
after pushing a possible fix.
Sometimes we want to run a subset of the test suite on CI for a given PR, or build a set of compiler artifacts from that PR, without attempting to merge it. We call this a "try build". A try build is started after a user with the proper permissions posts a PR comment with the `@bors try` command.
To do this, you'll need to edit [`src/ci/github-actions/jobs.yml`]. It contains three
sections that affect which CI jobs will be executed:
- The `pr` section defines everything that will run after a push to a PR.
- The `try` section defines job(s) that are run when you ask for a try build using `@bors try`.
- The `auto` section defines the full set of tests that are run after a PR is approved and before
it is merged into the main branch.
There are several use-cases for try builds:
You can copy one of the definitions from the `auto` section to the `pr` or `try` sections.
For example, the `x86_64-msvc` job is responsible for running the 64-bit MSVC tests.
- Run a set of performance benchmarks using our [rustc-perf] benchmark suite. For this, a working compiler build is needed, which can be generated with a try build that runs the [dist-x86_64-linux] CI job, which builds an optimized version of the compiler on Linux (this job is currently executed by default when you start a try build). To create a try build and schedule it for a performance benchmark, you can use the `@bors try @rust-timer queue` command combination.
- Check the impact of the PR across the Rust ecosystem, using a [crater] run. Again, a working compiler build is needed for this, which can be produced by the [dist-x86_64-linux] CI job.
- Run a specific CI job (e.g. Windows tests) on a PR, to quickly test if it passes the test suite executed by that job. You can select which CI jobs will be executed in the try build by adding up to 10 lines containing `try-job: <name of job>` to the PR description. All such specified jobs will be executed in the try build once the `@bors try` command is used on the PR. If no try jobs are specified in this way, the jobs defined in the `try` section of [`jobs.yml`] will be executed by default.
Try jobs are defined in the `try` section of [`jobs.yml`]. They are executed on the `try` branch under the `rust-lang-ci/rust` repository[^rust-lang-ci] and their results can be seen [here](https://github.com/rust-lang-ci/rust/actions), although usually you will be notified of the result by a comment made by bors on the corresponding PR.
Multiple try builds can execute concurrently across different PRs, but inside a single PR, at most one try build can be executing at the same time.
[rustc-perf]: https://github.com/rust-lang/rustc-perf
[crater]: https://github.com/rust-lang/crater
### Modifying CI jobs
If you want to modify what gets executed on our CI, you can simply modify the `pr`, `auto` or `try` sections of the [`jobs.yml`] file.
You can also modify what gets executed temporarily, for example to test a particular platform
or configuration that is challenging to test locally (for example, if a Windows build fails, but you don't have access to a Windows machine). Don't hesitate to use CI resources in such situations to try out a fix!
You can perform an arbitrary CI job in two ways:
- Use the [try build](#try-builds) functionality, and specify the CI jobs that you want to be
executed in try builds in your PR description.
- Modify the [`pr`](#pull-request-builds) section of `jobs.yml` to specify which CI jobs should be
executed after each push to your PR. This might be faster than repeatedly starting try builds.
To modify the jobs executed after each push to a PR, you can simply copy one of the job definitions from the `auto` section to the `pr` section. For example, the `x86_64-msvc` job is responsible for running the 64-bit MSVC tests.
You can copy it to the `pr` section to cause it to be executed after a commit is pushed to your
PR, like this:
@ -66,19 +109,169 @@ pr:
<<: *job-windows-8c
```
Then, you can commit the file and push to GitHub. GitHub Actions should launch the tests.
Then you can commit the file and push it to your PR branch on GitHub. GitHub Actions should then
execute this CI job after each push to your PR.
After you have finished your tests, don't forget to remove any changes you have made to `jobs.yml`.
If you need to make more complex modifications to CI, you will need to modify
[`.github/workflows/ci.yml`] and possibly also
[`src/ci/github-actions/calculate-job-matrix.py`].
**After you have finished your experiments, don't forget to remove any changes you have made to `jobs.yml`, if they were supposed to be temporary!**
Although you are welcome to use CI, just be conscious that this is a shared
resource with limited concurrency.
Try not to enable too many jobs at once (one or two should be sufficient in
resource with limited concurrency. Try not to enable too many jobs at once (one or two should be sufficient in
most cases).
[`src/ci/github-actions/jobs.yml`]: https://github.com/rust-lang/rust/blob/master/src/ci/github-actions/jobs.yml
## Merging PRs serially with bors
CI services usually test the last commit of a branch merged with the last
commit in `master`, and while thats great to check if the feature works in
isolation, it doesnt provide any guarantee the code is going to work once its
merged. Breakages like these usually happen when another, incompatible PR is
merged after the build happened.
To ensure a `master` branch that works all the time, we forbid manual merges. Instead,
all PRs have to be approved through our bot, [bors] (the software behind it is
called [homu]). All the approved PRs are put [in a queue][merge queue] (sorted
by priority and creation date) and are automatically tested one at the time. If
all the builders are green, the PR is merged, otherwise the failure is recorded
and the PR will have to be re-approved again.
Bors doesnt interact with CI services directly, but it works by pushing the
merge commit it wants to test to specific branches (like `auto` or `try`), which
are configured to execute CI checks. Bors then detects the
outcome of the build by listening for either Commit Statuses or Check Runs.
Since the merge commit is based on the latest `master` and only one can be tested
at the same time, when the results are green, `master` is fast-forwarded to that
merge commit.
Unfortunately testing a single PR at the time, combined with our long CI (~2
hours for a full run), means we cant merge too many PRs in a single day, and a
single failure greatly impacts our throughput for the day. The maximum number
of PRs we can merge in a day is around ~10.
The large CI run times and requirement for a large builder pool is largely due to the
fact that full release artifacts are built in the `dist-` builders. This is worth it
because these release artifacts:
- Allow perf testing even at a later date
- Allow bisection when bugs are discovered later
- Ensure release quality since if we're always releasing, we can catch problems early
### Rollups
Some PRs dont need the full test suite to be executed: trivial changes like
typo fixes or README improvements *shouldnt* break the build, and testing
every single one of them for 2+ hours is a big waste of time. To solve this,
we regularly create a "rollup", a PR where we merge several pending trivial
PRs so they can be tested together. Rollups are created manually by a team member
using the "create a rollup" button on the [merge queue]. The team member uses their
judgment to decide if a PR is risky or not, and are the best tool we have at
the moment to keep the queue in a manageable state.
## Docker
All CI jobs, except those on macOS and Windows, are executed inside that
platforms custom [Docker container]. This has a lot of advantages for us:
- The build environment is consistent regardless of the changes of the
underlying image (switching from the trusty image to xenial was painless for
us).
- We can use ancient build environments to ensure maximum binary compatibility,
for example [using older CentOS releases][dist-x86_64-linux] on our Linux builders.
- We can avoid reinstalling tools (like QEMU or the Android emulator) every
time thanks to Docker image caching.
- Users can run the same tests in the same environment locally by just running
`src/ci/docker/run.sh image-name`, which is awesome to debug failures.
The docker images prefixed with `dist-` are used for building artifacts while those without that prefix run tests and checks.
We also run tests for less common architectures (mainly Tier 2 and Tier 3
platforms) in CI. Since those platforms are not x86 we either run
everything inside QEMU or just cross-compile if we dont want to run the tests
for that platform.
These builders are running on a special pool of builders set up and maintained for us by GitHub.
[Docker container]: https://github.com/rust-lang/rust/tree/master/src/ci/docker
## Caching
Our CI workflow uses various caching mechanisms, mainly for two things:
### Docker images caching
The Docker images we use to run most of the Linux-based builders take a *long*
time to fully build. To speed up the build, we cache it using [Docker registry caching],
with the intermediate artifacts being stored on [ghcr.io]. We also push the built
Docker images to ghcr, so that they can be reused by other tools (rustup) or
by developers running the Docker build locally (to speed up their build).
Since we test multiple, diverged branches (`master`, `beta` and `stable`), we
cant rely on a single cache for the images, otherwise builds on a branch would
override the cache for the others. Instead, we store the images under different
tags, identifying them with a custom hash made from the contents of all the
Dockerfiles and related scripts.
[ghcr.io]: https://ghcr.io/rust-lang-ci/rust-ci
[Docker registry caching]: https://docs.docker.com/build/cache/backends/registry/
### LLVM caching with sccache
We build some C/C++ stuff in various CI jobs, and we rely on [sccache] to cache
the intermediate LLVM artifacts. Sccache is a distributed ccache developed by
Mozilla, which can use an object storage bucket as the storage backend. In our case,
the artefacts are uploaded to an S3 bucket that we control (`rust-lang-ci-sccache2`).
[sccache]: https://github.com/mozilla/sccache
## Custom tooling around CI
During the years we developed some custom tooling to improve our CI experience.
### Rust Log Analyzer to show the error message in PRs
The build logs for `rust-lang/rust` are huge, and its not practical to find
what caused the build to fail by looking at the logs. To improve the
developers experience we developed a bot called [Rust Log Analyzer][rla] (RLA)
that receives the build logs on failure and extracts the error message
automatically, posting it on the PR.
The bot is not hardcoded to look for error strings, but was trained with a
bunch of build failures to recognize which lines are common between builds and
which are not. While the generated snippets can be weird sometimes, the bot is
pretty good at identifying the relevant lines even if its an error we've never
seen before.
[rla]: https://github.com/rust-lang/rust-log-analyzer
### Toolstate to support allowed failures
The `rust-lang/rust` repo doesnt only test the compiler on its CI, but also a
variety of tools and documentation. Some documentation is pulled in via git
submodules. If we blocked merging rustc PRs on the documentation being fixed,
we would be stuck in a chicken-and-egg problem, because the documentation's CI
would not pass since updating it would need the not-yet-merged version of
rustc to test against (and we usually require CI to be passing).
To avoid the problem, submodules are allowed to fail, and their status is
recorded in [rust-toolstate]. When a submodule breaks, a bot automatically
pings the maintainers so they know about the breakage, and it records the
failure on the toolstate repository. The release process will then ignore
broken tools on nightly, removing them from the shipped nightlies.
While tool failures are allowed most of the time, theyre automatically
forbidden a week before a release: we dont care if tools are broken on nightly
but they must work on beta and stable, so they also need to work on nightly a
few days before we promote nightly to beta.
More information is available in the [toolstate documentation].
[rust-toolstate]: https://rust-lang-nursery.github.io/rust-toolstate
[toolstate documentation]: ../toolstate.md
[GitHub Actions]: https://github.com/rust-lang/rust/actions
[`jobs.yml`]: https://github.com/rust-lang/rust/blob/master/src/ci/github-actions/jobs.yml
[`.github/workflows/ci.yml`]: https://github.com/rust-lang/rust/blob/master/.github/workflows/ci.yml
[`src/ci/github-actions/calculate-job-matrix.py`]: https://github.com/rust-lang/rust/blob/master/src/ci/github-actions/calculate-job-matrix.py
[`calculate-job-matrix.py`]: https://github.com/rust-lang/rust/blob/master/src/ci/github-actions/calculate-job-matrix.py
[rust-lang-ci]: https://github.com/rust-lang-ci/rust/actions
[bors]: https://github.com/bors
[homu]: https://github.com/rust-lang/homu
[merge queue]: https://bors.rust-lang.org/queue/rust
[dist-x86_64-linux]: https://github.com/rust-lang/rust/blob/master/src/ci/docker/host-x86_64/dist-x86_64-linux/Dockerfile