Why is speed perturbation treated as different speakers?

Hi, thanks for releasing this great project!

I noticed in the code that when speed_perturb is enabled, the number of classes is multiplied by 3 (line 130, train.py):


```
    if configs['data_type'] != 'feat' and configs['dataset_args'][
            'speed_perturb']:
        # diff speed is regarded as diff spk
        configs['projection_args']['num_class'] *= 3
        if configs.get('do_lm', False):
            logger.info(
                'No speed perturb while doing large margin fine-tuning')
            configs['dataset_args']['speed_perturb'] = False
```

This seems to treat each speed-perturbed version of an utterance as if it were from a different speaker.
I would expect speed perturbation to keep the same speaker label (since the identity doesn’t change, only the speaking rate).

Could you clarify the motivation or provide references for this design choice?

Is there a specific paper or benchmark showing that treating speed-perturbed audio as different speakers improves performance?

Wouldn’t this risk confusing the model by artificially inflating the number of classes?

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why is speed perturbation treated as different speakers? #436

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why is speed perturbation treated as different speakers? #436

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions