-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Evaluation, Reproducibility, Benchmarks Meeting 42
Nicholas Heller edited this page Apr 1, 2026
·
1 revision
Date: 1st April, 2026
- Olivier
- Nick
- Rucha
- Carole
- Anne
- Michela
- Idea from Rucha
- Can we look at evaluation from the deployment side? For example, is this model suitable for a given site/population of images
- Could do a sort of user study, where we talk to people doing deployments and try to understand what the needs are
- Important to use tools that are already there -- metrics, etc.
- Next steps -- Rucha will reach out to member of the deployment working group to evaluate needs and make a plan
- From Olivier
- Working to identify datasets to show the utility of the CI project across different modalities -- path, radiology, surgical videos, etc.
- There are common pitfalls associated with each (failing to evaluate metrics at the patient level, for example)
- Ideally we would find benchmarks with lots of trained models available -- hopefully models will have respected official splits, but this is sometimes dubious
- Given the abundance of foundation models for 2D data, we are seeing more and more papers that sample/aggregate over 3D and expose users to these pitfalls
- Focus is on hierarchical data
- Working to identify datasets to show the utility of the CI project across different modalities -- path, radiology, surgical videos, etc.
- From Carole
- Got good feedback on implementation
- Still working on software paper, will plan to send draft shortly
- Targeting IEEE MLMI
- From Michela
- Becoming more available now -- will put something more concrete together for next meeting
- Rough ideas
- Looking at different datasets and how interobserver/intraobserver variance plays a role
- Questions of generalizability/idiosyncrasy of datasets
- DKFZ is building a data curation unit (per Annika) -- could be an interesting resource to connect with once they become more established
- Interesting questions around metrics used for quality
- Questions about semi-automatic ground truth labeling and biases that it introduces
- Questions about prevalence differences and representativeness of the set used for QC
- Carole going to put papers in GDrive
- Nick stepping down as secretary, Michela elected to fill roll
- Nick will share instructions will Michela offline
- Anne also very time limited moving forward -- offering to step down if filling her spot with another person might be able to contribute more
- Perhaps could recruit someone from her team to ensure that we keep representation from platform side
- Next meeting to be moved to the 29th of April