Skip to content

[CI] The Big Beautiful Build #3186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 4, 2025
Merged

Conversation

AlexandreSinger
Copy link
Contributor

@AlexandreSinger AlexandreSinger commented Jul 3, 2025

The CI for VTR was doing far more work than it needed to, which was
leading to long CI run times of around 1.5 hours on average. Overall,
the CI used 12.5 hours of compute and used more than 20 GitHub runners
at a time which hurt concurrency of runs.

The PR reduces the compute by only building VTR once for general builds
and using that build throughout when needed. This reduces the CI compute
down to 7.5 hours. In this process, I also added dependency chains to
try and schedule the runs efficiently to keep the number of active
GitHub runners below 10 (which is our current maximum).

I also fixed a bug with the way we have been using CCache. We were using
the same cache for all builds, which works fine for some projects; but
for ours that causes tons of cache misses when a gcc-11 build cache is
used for a Clang-18 cache for example. This PR makes each build's cache
unique, which enables better cache hit rates. I have found that when the
cache hit rate is perfect (i.e. the build is unchanged from the last
run), the CI uses less than 3 hours of compute and the test portion only
takes 20 minutes. When this happens, building the container actually
becomes the tall-pole since it often takes a little over 30 minutes.

Addendum: I also moved the coverity scan into the weekly CI run since it has never tripped in my time working on VTR and it adds a fixed 30 minutes to every CI run we do.

@github-actions github-actions bot added infra Project Infrastructure lang-shell Shell scripts (bash etc.) labels Jul 3, 2025
@AlexandreSinger AlexandreSinger force-pushed the feature-ci-build-cleanup branch 3 times, most recently from c87e68b to 420e318 Compare July 4, 2025 00:26
@github-actions github-actions bot added the scripts Utility & Infrastructure scripts label Jul 4, 2025
@AlexandreSinger AlexandreSinger force-pushed the feature-ci-build-cleanup branch 2 times, most recently from c9c4d75 to c931c29 Compare July 4, 2025 15:58
The CI for VTR was doing far more work than it needed to, which was
leading to long CI run times of around 1.5 hours on average. Overall,
the CI used 12.5 hours of compute and used more than 20 GitHub runners
at a time which hurt concurrency of runs.

The PR reduces the compute by only building VTR once for general builds
and using that build throughout when needed. This reduces the CI compute
down to 7.5 hours. In this process, I also added dependency chains to
try and schedule the runs efficiently to keep the number of active
GitHub runners below 10 (which is our current maximum).

I also fixed a bug with the way we have been using CCache. We were using
the same cache for all builds, which works fine for some projects; but
for ours that causes tons of cache misses when a gcc-11 build cache is
used for a Clang-18 cache for example. This PR makes each build's cache
unique, which enables better cache hit rates. I have found that when the
cache hit rate is perfect (i.e. the build is unchanged from the last
run), the CI uses less than 3 hours of compute and the test portion only
takes 20 minutes. When this happens, building the container actually
becomes the tall-pole since it often takes a little over 30 minutes.
@AlexandreSinger AlexandreSinger force-pushed the feature-ci-build-cleanup branch from 243a32b to 9620cce Compare July 4, 2025 17:41
@AlexandreSinger
Copy link
Contributor Author

AlexandreSinger commented Jul 4, 2025

Some data:

Prior to this change (with cache hits):

  • The average CI run time is 1.5 hours
  • The total compute for the test workflow was 12.5 hours (with containers taking 0.5 hours)
  • The slowest test was the BuildVariations test which took 47 minutes.

After this change (assuming the cache is not hit at all; which is exceedingly rare):

  • The CI run time is 1 hour
  • The total compute for the test workflow is 7.5 hours (with containers still taking 0.5 hours)
  • The slowest test was still BuildVariations which takes 54 minutes without a cache hit.

However, it is more likely that the cache will hit for a prior run (even from another CI run from a different branch). When the cache is perfectly hit (from a prior run from the same branch):

  • The CI run time is a little over 30 minutes
  • The total compute for the test workflow is 3 hours (with containers still taking 0.5 hours)
  • The slowest test is now the sanitized build that takes 20 minutes; however the containers test takes 30 minutes.

I will need to merge this into master to get an average (I need a field test), but I expect the average CI run time to be around 30-45 minutes (2x faster!).

Future work is to speed up the container build if possible (maybe we can using caching within, but I am not sure).

@AlexandreSinger AlexandreSinger changed the title [WIP][CI] Combined VTR Builds [CI] The Big Beautiful Build Jul 4, 2025
@AlexandreSinger
Copy link
Contributor Author

@amin1377 @AmirhosseinPoolad @vaughnbetz TL;DR, I think I was able to reduce the end to end run time of the CI to around 40 minutes (from 1.5 hours) and reduce the number of concurrent machines required immensely. Please review when you have a moment. I would like to merge this into master and see how it ends up working in the field.

@AlexandreSinger
Copy link
Contributor Author

Its dependency chain also looks cooler now:

image

VS. Just being flat.

@vaughnbetz
Copy link
Contributor

Looks great -- thanks @AlexandreSinger !
I wonder if coverity scan is working. Coverity is basically super-warnings (static code checker). We're using their free version, but if it never trips it may mean something is broken in the link to it and we aren't really running it.
Coverity isn't essential; with the warning levels turned way up on a lot of compilers we probably overlap most of what it checks already. If it isn't really running we should turn it off.

@vaughnbetz
Copy link
Contributor

Are the four pending tests ones that will never fire given the refactoring? If so, they should be removed from the test requirements.

@AlexandreSinger AlexandreSinger merged commit a2da057 into master Jul 4, 2025
30 checks passed
@AlexandreSinger AlexandreSinger deleted the feature-ci-build-cleanup branch July 4, 2025 21:49
@AlexandreSinger
Copy link
Contributor Author

Hi Vaughn, yes! They were renamed. I have removed them from the branch protection. Once things stabilize I will update the branch protection rules to include the new tests.

I merged this in to get the CI moved over sooner. We can decide what to do about the coverity scan later. I think keeping it around weekly is fine, but I agree, if its not doing anything it should be removed.

@vaughnbetz
Copy link
Contributor

Great, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infra Project Infrastructure lang-shell Shell scripts (bash etc.) scripts Utility & Infrastructure scripts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants