Skip to content

ContrastiveDataset iterator and ContrastiveDistillationDataset bug. #578

Open
@DemirTonchev

Description

@DemirTonchev

(updated)

ContrastiveDataset

ContrastiveDataset by default generates and builds the whole positive and negative examples internally in lists. Then in trainer.py

dataset = Dataset.from_list(list(data_sampler))

from_list method is used to create the dataset for training. This is fine with smaller datasets which I recognize is the whole point of the setfit method, yet if you want to build bigger dataset for whatever reason this can blow your RAM memory (as I have experienced). It would be best if you can build the dataset from a generator Dataset.from_generator While the current __iter__ method is correct the internal data that it uses is eagerly generated on __init__ . It would be best if the generator is truly "lazy".

Other question I am not sure why is what is the reason that replacement is set to True by default? Although there are some cases where this makes sense the matrix picture in the sampling guide shows the diagonal as empty (ie I wont expect the library to do the opposite by default - https://huggingface.co/docs/setfit/v1.1.0/en/conceptual_guides/sampling_strategies) Also there is no way to control that.

def shuffle_combinations(iterable: Iterable, replacement: bool = True) -> Generator:

Using this setting we get pairs such as:
("The movie was awesome", "The movie was awesome") as a positive pair in the training process.

ContrastiveDistillationDataset

ContrastiveDistillationDataset inherits from ContrastiveDataset but set its own

self.sentence_labels = list(enumerate(self.sentences))

by looking at the code you would expect that the pairs are generated using these self.sentence_labels but what happens is the super().__init__ goes first and there it uses the sentence_labels from ContrastiveDataset

self.sentence_labels = list(zip(self.sentences, self.labels))

and this results in creating the pairs using the function defined in ContrastiveDistillationDataset
def generate_pairs(self) -> None:

but with the sentence_labels defined by ContrastiveDataset.
This results in creating pairs of sentences that are always selecting the first row first column element from the cos_matrix as label.
this is what you expect by looking at the code

{"sentence_1": text_one, "sentence_2": text_two, "label": self.cos_sim_matrix[id_one][id_two]}

But because the init method runs with the sentence_labels generated from the parent you always get:

{"sentence_1": text_one, "sentence_2": text_two, "label": self.cos_sim_matrix[0][0]}

because the labels in the parent are set as 0 for each sentence. Then after the init you set self.sentence_labels = list(enumerate(self.sentences)) but this is never used as the pairs are already generated, very nasty and easily overlooked bug. This unfortunately makes ContrastiveDistillationDataset unusable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions