-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Bug description
Thanks for this excellent tutorial, learned a lot from this repo.
I followed the chapter 5 's 03_bonus_pretraining_on_gutenberg with fully gutenberg's data.
ncdu 1.15.1 ~ Use the arrow keys to navigate, press ? for help
--- /opt/repository/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg ----------------------------------------------------------------------------------------------------------------------------------------------------------------
89.7 GiB [##########] /gutenberg
87.8 GiB [######### ] /gutenberg_preprocessed
15.0 GiB [# ] /model_checkpoints
24.0 KiB [ ] /__pycache__
12.0 KiB [ ] previous_chapters.py
12.0 KiB [ ] pretraining_simple.py
12.0 KiB [ ] README.md
4.0 KiB [ ] prepare_dataset.py
4.0 KiB [ ] tests.pyand the model performed well at the first 70 thousand steps. The word sequence that appends to Every effort moves seems reasonable and readable.
(base) @l40s:/opt/repository/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg/model_checkpoints$ ll
total 15777456
drwxrwxr-x 2 ubuntu ubuntu 4096 Aug 4 02:53 ./
drwxrwxr-x 6 ubuntu ubuntu 4096 Aug 1 12:43 ../
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 2 11:08 model_pg_110263.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434735 Aug 1 15:01 model_pg_11338.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 2 13:27 model_pg_121652.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 2 16:11 model_pg_135104.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 2 18:26 model_pg_146207.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 2 20:45 model_pg_157532.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 3 00:47 model_pg_177488.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 3 03:07 model_pg_188954.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 3 05:24 model_pg_200250.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 3 06:02 model_pg_203300.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 3 08:20 model_pg_214703.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 3 10:40 model_pg_226160.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 3 12:58 model_pg_237489.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 3 15:23 model_pg_249333.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 3 19:23 model_pg_269109.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434735 Aug 1 18:28 model_pg_28314.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 3 22:52 model_pg_286227.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434912 Aug 4 02:53 model_pg_306036.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434735 Aug 1 20:47 model_pg_39673.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434735 Aug 2 00:13 model_pg_56588.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434735 Aug 2 03:40 model_pg_73617.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434735 Aug 2 05:58 model_pg_84892.pth
-rw-rw-r-- 1 ubuntu ubuntu 702434735 Aug 2 08:26 model_pg_97054.pthbut just now I login to this server which has a single L40s GPU. the model's loss is much lower but the text sequence is weird.
Ep 1 (Step 140900): Train loss 3.190, Val loss 3.301
Ep 1 (Step 141000): Train loss 2.814, Val loss 3.307
Every effort moves you, and I will not be able to help you. You are not going to be troubled with the idea of a new life. You are not going to be troubled with the idea of a new life. You are
Ep 1 (Step 141100): Train loss 2.836, Val loss 3.298
Ep 1 (Step 141200): Train loss 3.174, Val loss 3.303
Ep 1 (Step 141300): Train loss 2.953, Val loss 3.305
Ep 1 (Step 141400): Train loss 3.290, Val loss 3.294
Ep 1 (Step 141500): Train loss 2.784, Val loss 3.306
Ep 1 (Step 141600): Train loss 2.707, Val loss 3.316
Ep 1 (Step 141700): Train loss 3.126, Val loss 3.293
Ep 1 (Step 141800): Train loss 2.819, Val loss 3.317
Ep 1 (Step 141900): Train loss 2.922, Val loss 3.302
Ep 1 (Step 142000): Train loss 2.770, Val loss 3.311
....
Ep 1 (Step 303600): Train loss 1.942, Val loss 1.533
Ep 1 (Step 303700): Train loss 1.991, Val loss 1.545
Ep 1 (Step 303800): Train loss 2.034, Val loss 1.540
Ep 1 (Step 303900): Train loss 1.960, Val loss 1.539
Ep 1 (Step 304000): Train loss 1.966, Val loss 1.539
Every effort moves you 髫 1 髫 1 髫 1 髫 1 髫 1 髫 1 髫 1 髫 1 �
Ep 1 (Step 304100): Train loss 1.872, Val loss 1.533
Ep 1 (Step 304200): Train loss 2.053, Val loss 1.535
Ep 1 (Step 304300): Train loss 1.974, Val loss 1.536
Ep 1 (Step 304400): Train loss 1.944, Val loss 1.544
Ep 1 (Step 304500): Train loss 1.923, Val loss 1.539
Ep 1 (Step 304600): Train loss 1.891, Val loss 1.551
Ep 1 (Step 304700): Train loss 1.998, Val loss 1.545
Ep 1 (Step 304800): Train loss 1.892, Val loss 1.544
Ep 1 (Step 304900): Train loss 1.888, Val loss 1.543
Ep 1 (Step 305000): Train loss 2.020, Val loss 1.537
Every effort moves you 1 susceptible 1 susceptible 1 susceptible 1 susceptible 1 susceptible 1 susceptible 1 susceptible
Ep 1 (Step 305100): Train loss 1.906, Val loss 1.537
Ep 1 (Step 305200): Train loss 1.842, Val loss 1.542
Ep 1 (Step 305300): Train loss 2.080, Val loss 1.539
Ep 1 (Step 305400): Train loss 1.993, Val loss 1.536
Ep 1 (Step 305500): Train loss 2.016, Val loss 1.537
Ep 1 (Step 305600): Train loss 2.001, Val loss 1.533
Ep 1 (Step 305700): Train loss 1.844, Val loss 1.536
Ep 1 (Step 305800): Train loss 1.988, Val loss 1.533
Ep 1 (Step 305900): Train loss 1.590, Val loss 1.536
Ep 1 (Step 306000): Train loss 1.879, Val loss 1.536
Every effort moves you héré 1 dépouillé 1 dépouillé 1 dépouillé 1 dépouillé 1 dépouilléWhat operating system are you using?
Linux
Where do you run your code?
Other cloud environment (AWS, Azure, GCP)
Environment
[OK] Your Python version is 3.11.5
2024-08-04 03:26:08.627305: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-04 03:26:09.063873: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-04 03:26:09.194486: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-04 03:26:10.025488: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-04 03:26:13.318814: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/ubuntu/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
[OK] torch 2.4.0+cu121
[OK] jupyterlab 4.2.4
[OK] tiktoken 0.7.0
[OK] matplotlib 3.7.2
[OK] tensorflow 2.17.0
[OK] tqdm 4.66.4
[OK] numpy 1.26.4
[OK] pandas 2.2.2
[OK] psutil 6.0.0
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working