Skip to content

DiceScore uses average='micro' by default, while other methods use average='macro' #3031

@ZachParent

Description

@ZachParent

🐛 Bug

I noticed that the DiceScore metric in segmentation uses an average strategy of 'micro'

average: Optional[Literal["micro", "macro", "weighted", "none"]] = "micro",

which is different from typical multiclass averaging handling, as shown in the MulticlassStatScores class

average: Optional[Literal["micro", "macro", "weighted", "none"]] = "macro",

Combined with the default of include_background=True, this makes the DiceScore quite optimistic (> 80% dice with a pretrained segmentation model after 1 epoch of fine tuning) because a segmentation model will tend to be biased towards predicting background.

To Reproduce

Steps to reproduce the behavior...

Code sample

This demo shows various DiceScore intializations, applied to target and output tensors which are randomly intialized but show a bias towards background (class=0).

import torch
import torchmetrics
import pandas as pd

num_classes = 20
batch_size = 16
train_metrics = get_metric_collection(num_classes)
dice_score_default = torchmetrics.segmentation.DiceScore(
    input_format="index",
    num_classes=num_classes,
)
dice_score_no_bg = torchmetrics.segmentation.DiceScore(
    input_format="index",
    num_classes=num_classes,
    include_background=False,
)
dice_score_macro = torchmetrics.segmentation.DiceScore(
    input_format="index",
    num_classes=num_classes,
    average="macro",
)
dice_score_realistic = torchmetrics.segmentation.DiceScore(
    input_format="index",
    num_classes=num_classes,
    include_background=False,
    average="macro",
)

# Create example output and target tensors, where the background is 75% of the output
example_output = torch.randint(0, num_classes, (batch_size, 10, 10))
output_background_mask = torch.rand(example_output.shape) < 0.75
example_output[output_background_mask] = 0

# Create example target tensor, where the background is 75% of the target
example_target = torch.randint(0, num_classes, (batch_size, 10, 10))
target_background_mask = torch.rand(example_target.shape) < 0.75
example_target[target_background_mask] = 0

dice_score_default.update(example_output, example_target)
dice_score_no_bg.update(example_output, example_target)
dice_score_macro.update(example_output, example_target)
dice_score_realistic.update(example_output, example_target)

scores = {
    "include_background": ["True", "False"],
    "average='micro'": [
        dice_score_default.compute().item(),
        dice_score_no_bg.compute().item(),
    ],
    "average='macro'": [
        dice_score_macro.compute().item(),
        dice_score_realistic.compute().item(),
    ],
}
scores_df = pd.DataFrame(scores)
print(scores_df)
#   include_background  average='micro'  average='macro'
# 0               True         0.575000         0.042818
# 1              False         0.007878         0.005482

Expected behavior

The default initialization of DiceScore should be a sensible choice which gives realistic results. The current set of defaults, when used with an entirely random outputs and targets, but a typical distribution of 75% background in the output and target, the DiceScore is >50%. This is not representative, and the expected DiceScore should be <1%, since these are nearly random guesses.

Environment

  • Python & PyTorch Version (e.g., 1.0):
    • Python 3.12.9
    • PyTorch 2.6.0
    • torchmetrics 1.7.0
  • Any other relevant information such as OS (e.g., Linux):
    • Mac

Additional context

Maybe these defaults were chosen for a particular reason that I'm not familiar with, but it seems to me that the torchmetrics metrics should choose a consistent averaging method, and that for segmentation tasks, we should ignore the background by default.

I understand one reason not to make this change is because updating defaults may lead to unexpected changes for users which have not specific the average strategy.

I would be happy to make this change and add/update any relevant tests, if the community agrees.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions