ContrastiveDataset iterator and ContrastiveDistillationDataset bug.

(updated)
## ContrastiveDataset

ContrastiveDataset by default generates and builds the whole positive and negative examples internally in lists. Then in trainer.py 
https://github.com/huggingface/setfit/blob/146c7c9dacdc7dca678b2fffff8ddeb79dd762c2/src/setfit/trainer.py#L605
from_list method is used to create the dataset for training. This is fine with smaller datasets **which I recognize is the whole point of the setfit method**, yet if you want to build bigger dataset for whatever reason this can blow your RAM memory (as I have experienced). It would be best if you can build the dataset from a generator ```Dataset.from_generator``` While the current ```__iter__``` method is correct the internal data that it uses is eagerly generated on ```__init__``` . It would be best if the generator is truly "lazy".

Other question I am not sure why is what is the reason that replacement is set to True by default? Although there are some cases where this makes sense the matrix picture in the sampling guide shows the diagonal as empty (ie I wont expect the library to do the opposite by default - https://huggingface.co/docs/setfit/v1.1.0/en/conceptual_guides/sampling_strategies) Also there is no way to control that.

https://github.com/huggingface/setfit/blob/146c7c9dacdc7dca678b2fffff8ddeb79dd762c2/src/setfit/sampler.py#L15

Using this setting we get pairs such as:
("The movie was awesome", "The movie was awesome") as a positive pair in the training process.

## ContrastiveDistillationDataset 

ContrastiveDistillationDataset inherits from ContrastiveDataset but set its own 

https://github.com/huggingface/setfit/blob/146c7c9dacdc7dca678b2fffff8ddeb79dd762c2/src/setfit/sampler.py#L170

by looking at the code you would expect that the pairs are generated using these ```self.sentence_labels``` but what happens is the ```super().__init__``` goes first and there it uses the sentence_labels from ContrastiveDataset 
https://github.com/huggingface/setfit/blob/146c7c9dacdc7dca678b2fffff8ddeb79dd762c2/src/setfit/sampler.py#L63
and this results in creating the pairs using the function defined in ContrastiveDistillationDataset 
https://github.com/huggingface/setfit/blob/146c7c9dacdc7dca678b2fffff8ddeb79dd762c2/src/setfit/sampler.py#L178
but with the sentence_labels defined by ContrastiveDataset. 
This results in creating pairs of sentences that are always selecting the first row first column element from the cos_matrix as label. 
this is what you expect by looking at the code 
```
{"sentence_1": text_one, "sentence_2": text_two, "label": self.cos_sim_matrix[id_one][id_two]}
```
But because the init method runs with the sentence_labels generated from the parent you always get:
```
{"sentence_1": text_one, "sentence_2": text_two, "label": self.cos_sim_matrix[0][0]}
```
because the labels in the parent are set as 0 for each sentence. Then after the init you set ```self.sentence_labels = list(enumerate(self.sentences))``` but this is never used as the pairs are already generated, very nasty and easily overlooked bug. This unfortunately makes ContrastiveDistillationDataset unusable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ContrastiveDataset iterator and ContrastiveDistillationDataset bug. #578

ContrastiveDataset

ContrastiveDistillationDataset

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ContrastiveDataset iterator and ContrastiveDistillationDataset bug. #578

Description

ContrastiveDataset

ContrastiveDistillationDataset

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions