Having trouble fine tuning after upgrading to torchtune 0.5 #2511

troy256 · 2025-03-18T15:03:55Z

troy256
Mar 18, 2025

I had fine tuning working with Llama3.1 8B on an earlier version of torchtune but after upgrading torchtune to 0.5 cannot get it to run again. I tried grabbing the new recipe (this is single GPU, LoRA) and updating my recipe with the new parameters but am now getting this error. This is an instruct dataset:

`INFO:torchtune.utils._logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:

batch_size: 1
checkpointer:
component: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /data/HF-llama3.1-8b-instruct/
checkpoint_files:

model-00001-of-00004.safetensors
model-00002-of-00004.safetensors
model-00003-of-00004.safetensors
model-00004-of-00004.safetensors
model_type: LLAMA3
output_dir: /data/tuned_model
recipe_checkpoint: recipe_state.pt
compile: false
dataset:
component: torchtune.datasets.instruct_dataset
column_map:
input: prompt
output: response
data_files: /data/torchtune/dataset/instruct/parquet/train-algol-manual-vol1-instruct-0003.parquet
new_system_prompt: Are are an AI assistant who provides helpful and accurate answers
to questions.
source: parquet
split: train
train_on_input: true
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: true
epochs: 4
gradient_accumulation_steps: 4
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
component: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
component: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
num_warmup_steps: 100
max_steps_per_epoch: null
metric_logger:
component: torchtune.training.metric_logging.DiskLogger
log_dir: /data/tuned-model/logs
model:
component: torchtune.models.llama3_1.lora_llama3_1_8b
apply_lora_to_mlp: true
apply_lora_to_output: false
lora_alpha: 16
lora_attn_modules:
- q_proj
- v_proj
- output_proj
  lora_dropout: 0.0
  lora_rank: 8
  optimizer:
  component: torch.optim.AdamW
  lr: 0.0003
  weight_decay: 0.01
  optimizer_in_bwd: false
  output_dir: /data/tuned-model
  profiler:
  component: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: /data/tuned-model/logs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
  resume_from_checkpoint: false
  save_adapter_weights_only: false
  seed: null
  shuffle: true
  tokenizer:
  component: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /data/HF-llama3.1-8b-instruct/original/tokenizer.model

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 3130938658. Local seed is seed + rank = 3130938658 + 0
Writing logs to /data/tuned-model/logs/log_1742309993.txt
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Memory stats after model init:
GPU peak memory allocation: 15.06 GiB
GPU peak memory reserved: 15.18 GiB
GPU peak memory active: 15.06 GiB
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:Optimizer and loss are initialized.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:Learning rate scheduler is initialized.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|25|Loss: 2.439957618713379: 2%|████ | 25/1049 [00:27<18:16, 1.07s/it]Traceback (most recent call last):
File "/data/pe/bin/tune", line 8, in
sys.exit(main())
^^^^^^
File "/data/pe/lib/python3.12/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/data/pe/lib/python3.12/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/data/pe/lib/python3.12/site-packages/torchtune/_cli/run.py", line 214, in _run_cmd
self._run_single_device(args, is_builtin=is_builtin)
File "/data/pe/lib/python3.12/site-packages/torchtune/_cli/run.py", line 108, in _run_single_device
runpy.run_path(str(args.recipe), run_name="main")
File "", line 286, in run_path
File "", line 98, in _run_module_code
File "", line 88, in _run_code
File "/data/pe/lib/python3.12/site-packages/recipes/lora_finetune_single_device.py", line 803, in
sys.exit(recipe_main())
^^^^^^^^^^^^^
File "/data/pe/lib/python3.12/site-packages/torchtune/config/_parse.py", line 99, in wrapper
sys.exit(recipe_main(conf))
^^^^^^^^^^^^^^^^^
File "/data/pe/lib/python3.12/site-packages/recipes/lora_finetune_single_device.py", line 798, in recipe_main
recipe.train()
File "/data/pe/lib/python3.12/site-packages/recipes/lora_finetune_single_device.py", line 678, in train
for idx, batch in enumerate(self._dataloader):
File "/data/pe/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 708, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/data/pe/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 764, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/pe/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
~~~~~~~~~~~~^^^^^
File "/data/pe/lib/python3.12/site-packages/torchtune/datasets/_concat.py", line 90, in getitem
return dataset[index - start]
~~~~~~~^^^^^^^^^^^^^^^
File "/data/pe/lib/python3.12/site-packages/torchtune/datasets/_sft.py", line 118, in getitem
return self._prepare_sample(sample)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/pe/lib/python3.12/site-packages/torchtune/datasets/_sft.py", line 125, in _prepare_sample
tokenized_dict = self._model_transform(transformed_sample)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/pe/lib/python3.12/site-packages/torchtune/models/llama3/_tokenizer.py", line 345, in call
tokens, mask = self.tokenize_messages(messages, add_end_tokens=not inference)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/pe/lib/python3.12/site-packages/torchtune/models/llama3/_tokenizer.py", line 308, in tokenize_messages
tokenized_message = self.tokenize_message(
^^^^^^^^^^^^^^^^^^^^^^
File "/data/pe/lib/python3.12/site-packages/torchtune/models/llama3/_tokenizer.py", line 255, in tokenize_message
tokenized_body = self._tokenize_body(message)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/pe/lib/python3.12/site-packages/torchtune/models/llama3/_tokenizer.py", line 224, in _tokenize_body
item["content"].strip(), add_bos=False, add_eos=False
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'strip'
1|25|Loss: 2.439957618713379: 2%|████ `

troy256 · 2025-03-27T15:04:13Z

troy256
Mar 27, 2025
Author

Just wanted to add that I upgraded to torchtune 0.6 and getting same issue and can't figure out what's wrong. Any help or suggestion would be greatly appreciated!

1 reply

joecummings Mar 28, 2025
Collaborator

Hmmm, my best guess is that there's an empty element in your dataset. Can you try adding a filter_fn to your dataset? See here: https://huggingface.co/docs/datasets/v2.20.0/process#select-and-filter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Having trouble fine tuning after upgrading to torchtune 0.5 #2511

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Having trouble fine tuning after upgrading to torchtune 0.5 #2511

Uh oh!

troy256 Mar 18, 2025

Replies: 1 comment · 1 reply

Uh oh!

troy256 Mar 27, 2025 Author

Uh oh!

joecummings Mar 28, 2025 Collaborator

troy256
Mar 18, 2025

Replies: 1 comment 1 reply

troy256
Mar 27, 2025
Author

joecummings Mar 28, 2025
Collaborator