Conversation
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #4866 +/- ##
==========================================
- Coverage 65.66% 65.57% -0.09%
==========================================
Files 1153 1154 +1
Lines 168981 169569 +588
==========================================
+ Hits 110953 111202 +249
- Misses 58028 58367 +339 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| let utilities = self | ||
| .server | ||
| .utilities() | ||
| .downcast::<FusionUtilities>() | ||
| .expect("Can downcast to `FusionUtilities`"); | ||
| let id = CommunicationId::from(device_ids); | ||
| if utilities.initialized_comms.read().unwrap().contains(&id) { | ||
| self.flush_queue(); | ||
| let mut initialized_comms = utilities.initialized_comms.write().unwrap(); | ||
| initialized_comms.insert(id); | ||
| } |
There was a problem hiding this comment.
Would it be possible to call ensure_collective_init using the inner backend?
There was a problem hiding this comment.
We need to initialize the communication on the first call to a collective operation. Initializing is blocking for the server, so we need to make sure to flush right away so that other devices don't end up stuck on an initialization call from another device.
This is already handled by cubecl, but since fusion adds another layer of streams and asynchronous submits, we also needed to add some logic here to flush the fusion server.
Maybe there is a way to avoid this by design when handling collective calls? We can also chat about this offline as it can be quite complex/confusing.
Pull Request Template
Checklist
cargo run-checkscommand has been executed.Related Issues/PRs
Depends on tracel-ai/cubecl#1304
Changes
to_clientapiTesting
Unit tests + text-cla + benchmarks
