DiceScore uses average='micro' by default, while other methods use average='macro'

## 🐛 Bug

I noticed that the DiceScore metric in segmentation uses an average strategy of `'micro'`
https://github.com/Lightning-AI/torchmetrics/blob/2c28e25dfc81d8d7208362be6d85769b3672d953/src/torchmetrics/segmentation/dice.py#L113

which is different from typical multiclass averaging handling, as shown in the MulticlassStatScores class
https://github.com/Lightning-AI/torchmetrics/blob/2c28e25dfc81d8d7208362be6d85769b3672d953/src/torchmetrics/classification/stat_scores.py#L312

Combined with the default of `include_background=True`, this makes the DiceScore quite optimistic (> 80% dice with a pretrained segmentation model after 1 epoch of fine tuning) because a segmentation model will tend to be biased towards predicting background.

### To Reproduce

Steps to reproduce the behavior...

<details>
  <summary>Code sample</summary>

This demo shows various DiceScore intializations, applied to target and output tensors which are randomly intialized but show a bias towards background (class=0).
```python
import torch
import torchmetrics
import pandas as pd

num_classes = 20
batch_size = 16
train_metrics = get_metric_collection(num_classes)
dice_score_default = torchmetrics.segmentation.DiceScore(
    input_format="index",
    num_classes=num_classes,
)
dice_score_no_bg = torchmetrics.segmentation.DiceScore(
    input_format="index",
    num_classes=num_classes,
    include_background=False,
)
dice_score_macro = torchmetrics.segmentation.DiceScore(
    input_format="index",
    num_classes=num_classes,
    average="macro",
)
dice_score_realistic = torchmetrics.segmentation.DiceScore(
    input_format="index",
    num_classes=num_classes,
    include_background=False,
    average="macro",
)

# Create example output and target tensors, where the background is 75% of the output
example_output = torch.randint(0, num_classes, (batch_size, 10, 10))
output_background_mask = torch.rand(example_output.shape) < 0.75
example_output[output_background_mask] = 0

# Create example target tensor, where the background is 75% of the target
example_target = torch.randint(0, num_classes, (batch_size, 10, 10))
target_background_mask = torch.rand(example_target.shape) < 0.75
example_target[target_background_mask] = 0

dice_score_default.update(example_output, example_target)
dice_score_no_bg.update(example_output, example_target)
dice_score_macro.update(example_output, example_target)
dice_score_realistic.update(example_output, example_target)

scores = {
    "include_background": ["True", "False"],
    "average='micro'": [
        dice_score_default.compute().item(),
        dice_score_no_bg.compute().item(),
    ],
    "average='macro'": [
        dice_score_macro.compute().item(),
        dice_score_realistic.compute().item(),
    ],
}
scores_df = pd.DataFrame(scores)
print(scores_df)
#   include_background  average='micro'  average='macro'
# 0               True         0.575000         0.042818
# 1              False         0.007878         0.005482
```
</details>

### Expected behavior

The default initialization of DiceScore should be a sensible choice which gives realistic results. The current set of defaults, when used with an entirely random outputs and targets, but a typical distribution of 75% background in the output and target, the DiceScore is >50%. This is not representative, and the expected DiceScore should be <1%, since these are nearly random guesses.

### Environment

- Python & PyTorch Version (e.g., 1.0):
  - Python 3.12.9
  - PyTorch 2.6.0
  - torchmetrics 1.7.0
- Any other relevant information such as OS (e.g., Linux):
  - Mac

### Additional context

Maybe these defaults were chosen for a particular reason that I'm not familiar with, but it seems to me that the torchmetrics metrics should choose a consistent averaging method, and that for segmentation tasks, we should ignore the background by default.

I understand one reason not to make this change is because updating defaults may lead to unexpected changes for users which have not specific the average strategy.

I would be happy to make this change and add/update any relevant tests, if the community agrees.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DiceScore uses average='micro' by default, while other methods use average='macro' #3031

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DiceScore uses average='micro' by default, while other methods use average='macro' #3031

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions