-
Notifications
You must be signed in to change notification settings - Fork 421
[CI] The Big Beautiful Build #3186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c87e68b
to
420e318
Compare
c9c4d75
to
c931c29
Compare
The CI for VTR was doing far more work than it needed to, which was leading to long CI run times of around 1.5 hours on average. Overall, the CI used 12.5 hours of compute and used more than 20 GitHub runners at a time which hurt concurrency of runs. The PR reduces the compute by only building VTR once for general builds and using that build throughout when needed. This reduces the CI compute down to 7.5 hours. In this process, I also added dependency chains to try and schedule the runs efficiently to keep the number of active GitHub runners below 10 (which is our current maximum). I also fixed a bug with the way we have been using CCache. We were using the same cache for all builds, which works fine for some projects; but for ours that causes tons of cache misses when a gcc-11 build cache is used for a Clang-18 cache for example. This PR makes each build's cache unique, which enables better cache hit rates. I have found that when the cache hit rate is perfect (i.e. the build is unchanged from the last run), the CI uses less than 3 hours of compute and the test portion only takes 20 minutes. When this happens, building the container actually becomes the tall-pole since it often takes a little over 30 minutes.
243a32b
to
9620cce
Compare
Some data: Prior to this change (with cache hits):
After this change (assuming the cache is not hit at all; which is exceedingly rare):
However, it is more likely that the cache will hit for a prior run (even from another CI run from a different branch). When the cache is perfectly hit (from a prior run from the same branch):
I will need to merge this into master to get an average (I need a field test), but I expect the average CI run time to be around 30-45 minutes (2x faster!). Future work is to speed up the container build if possible (maybe we can using caching within, but I am not sure). |
@amin1377 @AmirhosseinPoolad @vaughnbetz TL;DR, I think I was able to reduce the end to end run time of the CI to around 40 minutes (from 1.5 hours) and reduce the number of concurrent machines required immensely. Please review when you have a moment. I would like to merge this into master and see how it ends up working in the field. |
Looks great -- thanks @AlexandreSinger ! |
Are the four pending tests ones that will never fire given the refactoring? If so, they should be removed from the test requirements. |
Hi Vaughn, yes! They were renamed. I have removed them from the branch protection. Once things stabilize I will update the branch protection rules to include the new tests. I merged this in to get the CI moved over sooner. We can decide what to do about the coverity scan later. I think keeping it around weekly is fine, but I agree, if its not doing anything it should be removed. |
Great, thanks. |
The CI for VTR was doing far more work than it needed to, which was
leading to long CI run times of around 1.5 hours on average. Overall,
the CI used 12.5 hours of compute and used more than 20 GitHub runners
at a time which hurt concurrency of runs.
The PR reduces the compute by only building VTR once for general builds
and using that build throughout when needed. This reduces the CI compute
down to 7.5 hours. In this process, I also added dependency chains to
try and schedule the runs efficiently to keep the number of active
GitHub runners below 10 (which is our current maximum).
I also fixed a bug with the way we have been using CCache. We were using
the same cache for all builds, which works fine for some projects; but
for ours that causes tons of cache misses when a gcc-11 build cache is
used for a Clang-18 cache for example. This PR makes each build's cache
unique, which enables better cache hit rates. I have found that when the
cache hit rate is perfect (i.e. the build is unchanged from the last
run), the CI uses less than 3 hours of compute and the test portion only
takes 20 minutes. When this happens, building the container actually
becomes the tall-pole since it often takes a little over 30 minutes.
Addendum: I also moved the coverity scan into the weekly CI run since it has never tripped in my time working on VTR and it adds a fixed 30 minutes to every CI run we do.