Description
Describe the issue
I'm struggling make fluxgym works and training a LORA, I'm using the default flux-dev model and I've been getting avr_loss=nan in the training
This is the last script that I tried due I tried to fix the issue testing with several combinations always getting the same:
accelerate launch ^
--num_cpu_threads_per_process 1 ^
sd-scripts/flux_train_network.py ^
--pretrained_model_name_or_path "c:\llm\fluxgym\models\unet\flux1-dev.sft" ^
--clip_l "c:\llm\fluxgym\models\clip\clip_l.safetensors" ^
--t5xxl "c:\llm\fluxgym\models\clip\t5xxl_fp16.safetensors" ^
--ae "c:\llm\fluxgym\models\vae\ae.sft" ^
--cache_latents_to_disk ^
--save_model_as safetensors ^
--sdpa --persistent_data_loader_workers ^
--max_data_loader_n_workers 2 ^
--seed 42 ^
--gradient_checkpointing ^
--mixed_precision no ^
--save_precision float ^
--network_module networks.lora_flux ^
--network_dim 2 ^
--optimizer_type adamw ^
--optimizer_args "betas=(0.9,0.999)" "eps=1e-8" "weight_decay=0.01" ^
--lr_scheduler cosine_with_restarts ^
--learning_rate 5e-5 ^
--cache_text_encoder_outputs ^
--cache_text_encoder_outputs_to_disk ^
--highvram ^
--max_train_epochs 16 ^
--save_every_n_epochs 4 ^
--dataset_config "c:\llm\fluxgym\outputs\test\dataset.toml" ^
--output_dir "c:\llm\fluxgym\outputs\test" ^
--output_name test ^
--timestep_sampling shift ^
--discrete_flow_shift 3.1582 ^
--model_prediction_type raw ^
--guidance_scale 1 ^
--loss_type l2 ^
--max_grad_norm 1.0 ^
Output Example:
[2025-05-27 20:19:19] [INFO] current_epoch: 0, epoch: 1
[2025-05-27 20:20:34] [INFO] steps: 0%| | 1/320 [00:37<3:21:00, 37.81s/it]
steps: 0%| | 1/320 [00:37<3:21:00, 37.81s/it, avr_loss=nan]
steps: 1%| | 2/320 [00:43<1:54:08, 21.54s/it, avr_loss=nan]
steps: 1%| | 2/320 [00:43<1:54:08, 21.54s/it, avr_loss=nan]
steps: 1%| | 3/320 [00:48<1:24:51, 16.06s/it, avr_loss=nan]
steps: 1%| | 3/320 [00:48<1:24:51, 16.06s/it, avr_loss=nan]