Skip to content

Why is speed perturbation treated as different speakers? #436

@TranThanh96

Description

@TranThanh96

Hi, thanks for releasing this great project!

I noticed in the code that when speed_perturb is enabled, the number of classes is multiplied by 3 (line 130, train.py):

    if configs['data_type'] != 'feat' and configs['dataset_args'][
            'speed_perturb']:
        # diff speed is regarded as diff spk
        configs['projection_args']['num_class'] *= 3
        if configs.get('do_lm', False):
            logger.info(
                'No speed perturb while doing large margin fine-tuning')
            configs['dataset_args']['speed_perturb'] = False

This seems to treat each speed-perturbed version of an utterance as if it were from a different speaker.
I would expect speed perturbation to keep the same speaker label (since the identity doesn’t change, only the speaking rate).

Could you clarify the motivation or provide references for this design choice?

Is there a specific paper or benchmark showing that treating speed-perturbed audio as different speakers improves performance?

Wouldn’t this risk confusing the model by artificially inflating the number of classes?

Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions