You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* update documentation links to point to the website
* Fix encoding
* Add rough time estimator based on historical stats
* Fix train_test split naming logic; add quiet mode for running inside scripts
* Add a finetuning step by step example for a classification use case.
* add classification params if train and valid set; add length_validator
immediate_msg=f"\n- There are {len(long_indexes)} examples that are very long. These are rows: {long_indexes}\nFor conditional generation, and for classification the examples shouldn't be longer than 2048 tokens."
173
+
optional_msg=f"Remove {len(long_indexes)} long examples"
174
+
175
+
defoptional_fn(x):
176
+
returnx.drop(long_indexes)
177
+
178
+
returnRemediation(
179
+
name="long_examples",
180
+
immediate_msg=immediate_msg,
181
+
optional_msg=optional_msg,
182
+
optional_fn=optional_fn,
183
+
)
184
+
185
+
156
186
defcommon_prompt_suffix_validator(df):
157
187
"""
158
188
This validator will suggest to add a common suffix to the prompt if one doesn't already exist in case of classification or conditional generation.
@@ -210,7 +240,7 @@ def add_suffix(x, suffix):
210
240
immediate_msg+=f"\n WARNING: Some of your prompts contain the suffix `{common_suffix}` more than once. We strongly suggest that you review your prompts and add a unique suffix"
211
241
212
242
else:
213
-
immediate_msg="\n- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See `Fine Tuning How to Guide` for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty"
243
+
immediate_msg="\n- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty"
214
244
215
245
ifcommon_suffix=="":
216
246
optional_msg= (
@@ -361,7 +391,7 @@ def add_suffix(x, suffix):
361
391
immediate_msg+=f"\n WARNING: Some of your completions contain the suffix `{common_suffix}` more than once. We suggest that you review your completions and add a unique ending"
362
392
363
393
else:
364
-
immediate_msg="\n- Your data does not contain a common ending at the end of your completions. Having a common ending string appended to the end of the completion makes it clearer to the fine-tuned model where the completion should end. See `Fine Tuning How to Guide` for more detail and examples."
394
+
immediate_msg="\n- Your data does not contain a common ending at the end of your completions. Having a common ending string appended to the end of the completion makes it clearer to the fine-tuned model where the completion should end. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples."
immediate_msg="\n- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See `Fine Tuning How to Guide` for more details"
429
+
immediate_msg="\n- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details"
400
430
optional_msg="Add a whitespace character to the beginning of the completion"
401
431
optional_fn=add_space_start
402
432
returnRemediation(
@@ -430,7 +460,7 @@ def lower_case(x):
430
460
ifcount_upper*2>count_lower:
431
461
returnRemediation(
432
462
name="lower_case",
433
-
immediate_msg=f"\n- More than a third of your `{column}` column/key is uppercase. Uppercase {column}s tends to perform worse than a mixture of case encountered in normal language. We recommend to lower case the data if that makes sense in your domain. See `Fine Tuning How to Guide` for more details",
463
+
immediate_msg=f"\n- More than a third of your `{column}` column/key is uppercase. Uppercase {column}s tends to perform worse than a mixture of case encountered in normal language. We recommend to lower case the data if that makes sense in your domain. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details",
434
464
optional_msg=f"Lowercase all your data in column/key `{column}`",
Estimate the time it'll take to fine-tune the dataset
593
+
"""
594
+
ft_format=infer_task_type(df)
595
+
expected_time=1.0
596
+
ifft_format=="classification":
597
+
num_examples=len(df)
598
+
expected_time=num_examples*1.44
599
+
else:
600
+
size=df.memory_usage(index=True).sum()
601
+
expected_time=size*0.0515
602
+
603
+
defformat_time(time):
604
+
iftime<60:
605
+
returnf"{round(time, 2)} seconds"
606
+
eliftime<3600:
607
+
returnf"{round(time/60, 2)} minutes"
608
+
eliftime<86400:
609
+
returnf"{round(time/3600, 2)} hours"
610
+
else:
611
+
returnf"{round(time/86400, 2)} days"
612
+
613
+
time_string=format_time(expected_time+140)
614
+
sys.stdout.write(
615
+
f"Once your model starts training, it'll approximately take {time_string} to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.\n"
This function will write out a dataframe to a file, if the user would like to proceed, and also offer a fine-tuning command with the newly created file.
552
644
For classification it will optionally ask the user if they would like to split the data into train/valid files, and modify the suggested command to include the valid set.
input_text="\n\nYour data will be written to a new JSONL file. Proceed [Y/n]: "
672
+
579
673
ifnotany_remediations:
580
674
sys.stdout.write(
581
-
f'\nYou can use your file for fine-tuning:\n> openai api fine_tunes.create -t "{fname}"{packing_param}\n\nAfter you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `{common_prompt_suffix_new_line_handled}` for the model to start generating completions, rather than continuing with the prompt.{optional_ending_string}\n'
675
+
f'\nYou can use your file for fine-tuning:\n> openai api fine_tunes.create -t "{fname}"{classification_params}\n\nAfter you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `{common_prompt_suffix_new_line_handled}` for the model to start generating completions, rather than continuing with the prompt.{optional_ending_string}\n'
# Add -v VALID_FILE if we split the file into train / valid
630
-
files_string= ("s"ifsplitelse"") +" to `"+ ("` and `".join(outfnames))
631
-
valid_string=f' -v "{outfnames[1]}"'ifsplitelse""
709
+
files_string= ("s"ifsplitelse"") +" to `"+ ("` and `".join(fnames))
710
+
valid_string=f' -v "{fnames[1]}"'ifsplitelse""
632
711
separator_reminder= (
633
712
""
634
713
iflen(common_prompt_suffix_new_line_handled) ==0
635
714
elsef"After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `{common_prompt_suffix_new_line_handled}` for the model to start generating completions, rather than continuing with the prompt."
636
715
)
637
716
sys.stdout.write(
638
-
f'\nWrote modified file{files_string}`\nFeel free to take a look!\n\nNow use that file when fine-tuning:\n> openai api fine_tunes.create -t "{outfnames[0]}"{valid_string}{packing_param}\n\n{separator_reminder}{optional_ending_string}\n'
717
+
f'\nWrote modified file{files_string}`\nFeel free to take a look!\n\nNow use that file when fine-tuning:\n> openai api fine_tunes.create -t "{fnames[0]}"{valid_string}{classification_params}\n\n{separator_reminder}{optional_ending_string}\n'
639
718
)
719
+
estimate_fine_tuning_time(df)
640
720
else:
641
721
sys.stdout.write("Aborting... did not write the file\n")
0 commit comments