Skip to content

Evaluation, Reproducibility, Benchmarks Meeting 42

Nicholas Heller edited this page Apr 1, 2026 · 1 revision

Minutes of Meeting 42

Date: 1st April, 2026

Present

  • Olivier
  • Nick
  • Rucha
  • Carole
  • Anne
  • Michela

Updates

  • Idea from Rucha
    • Can we look at evaluation from the deployment side? For example, is this model suitable for a given site/population of images
    • Could do a sort of user study, where we talk to people doing deployments and try to understand what the needs are
    • Important to use tools that are already there -- metrics, etc.
    • Next steps -- Rucha will reach out to member of the deployment working group to evaluate needs and make a plan
  • From Olivier
    • Working to identify datasets to show the utility of the CI project across different modalities -- path, radiology, surgical videos, etc.
      • There are common pitfalls associated with each (failing to evaluate metrics at the patient level, for example)
    • Ideally we would find benchmarks with lots of trained models available -- hopefully models will have respected official splits, but this is sometimes dubious
    • Given the abundance of foundation models for 2D data, we are seeing more and more papers that sample/aggregate over 3D and expose users to these pitfalls
    • Focus is on hierarchical data
  • From Carole
    • Got good feedback on implementation
    • Still working on software paper, will plan to send draft shortly
    • Targeting IEEE MLMI
  • From Michela
    • Becoming more available now -- will put something more concrete together for next meeting
    • Rough ideas
      • Looking at different datasets and how interobserver/intraobserver variance plays a role
      • Questions of generalizability/idiosyncrasy of datasets
    • DKFZ is building a data curation unit (per Annika) -- could be an interesting resource to connect with once they become more established
    • Interesting questions around metrics used for quality
    • Questions about semi-automatic ground truth labeling and biases that it introduces
    • Questions about prevalence differences and representativeness of the set used for QC
    • Carole going to put papers in GDrive

Administrative Item

  • Nick stepping down as secretary, Michela elected to fill roll
    • Nick will share instructions will Michela offline
  • Anne also very time limited moving forward -- offering to step down if filling her spot with another person might be able to contribute more
    • Perhaps could recruit someone from her team to ensure that we keep representation from platform side
  • Next meeting to be moved to the 29th of April

Clone this wiki locally