A complete implementation of a 125M parameter LLaMA-style transformer model with training, continued pretraining, and instruction fine-tuning capabilities.
This project implements a scaled-down version of the LLaMA architecture (125M parameters) with the following features:
- Custom LLaMA Implementation: Complete transformer architecture with RMSNorm, SwiGLU, and Rotary Position Embeddings (RoPE)
- Initial Training: Pretraining on TinyStories dataset
- Continued Pretraining: Resume training from checkpoints with improved data handling
- Instruction Fine-tuning: Fine-tune on Alpaca dataset for instruction-following capabilities
- Interactive Chat: Chat interface for testing the fine-tuned model
- Comprehensive Testing: Multiple scripts for model evaluation and sampling
βββ train_llama125m.py # Main training script - initial pretraining
βββ continue_pretrain.py # Continued pretraining functionality
βββ finetune_instructions.py # Instruction fine-tuning on Alpaca dataset
βββ test_finetuned_model.py # Test fine-tuned model with predefined prompts
βββ chat_with_model.py # Interactive chat interface
βββ sample_after_training.py # Generate samples from trained models
βββ requirements.txt # Python dependencies (empty - see installation)
βββ note.md # Training notes and sample outputs
βββ .gitignore # Git ignore file
βββ llama125m_tinystories/ # Directory for base model checkpoints
βββ llama125m_alpaca/ # Directory for fine-tuned model checkpoints
The LLaMA 125M model implements the following components:
- RMSNorm: Root Mean Square Layer Normalization for better training stability
- SwiGLU: Swish-Gated Linear Unit activation function in feed-forward networks
- Rotary Position Embeddings (RoPE): Relative position encoding for better sequence understanding
- Multi-head Attention: Standard transformer attention with causal masking
- Parameters: ~125M
- Hidden Dimension: 768
- Layers: 12
- Attention Heads: 12
- Feed-forward Multiplier: 4x
- Maximum Sequence Length: 512 (training), 5000 (inference)
- Vocabulary Size: GPT-2 tokenizer (~50,257 tokens)
- Python 3.8+
- CUDA-capable GPU (recommended)
- 8GB+ GPU memory for training
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers datasetstorch- PyTorch frameworktransformers- Hugging Face transformers librarydatasets- Hugging Face datasets librarymath,time,os- Standard library modules
Train the model from scratch on TinyStories dataset:
python train_llama125m.pyTraining Details:
- Dataset: TinyStories (10% subset, ~21k samples)
- Batch Size: 32
- Learning Rate: 1e-4 with cosine annealing
- Training Steps: 500
- Sequence Length: 128 tokens
- Optimizer: AdamW with weight decay (0.01)
Continue training from the base checkpoint:
python continue_pretrain.pyFeatures:
- Resumes from existing checkpoints
- Lower learning rate (5e-5) for stability
- Automatic checkpoint saving every 500 steps
- Gradient clipping for training stability
Fine-tune the model for instruction-following:
python finetune_instructions.pyFine-tuning Details:
- Dataset: Alpaca instruction dataset (100% by default)
- Format: Instruction β Input β Response structure
- Training Steps: 4000
- Learning Rate: 5e-3
- Batch Size: 8
- Sequence Length: 256 tokens
Run predefined tests on the instruction-tuned model:
python test_finetuned_model.pyStart an interactive conversation with the fine-tuned model:
python chat_with_model.pyChat Features:
- Interactive command-line interface
- Adjustable generation parameters (temperature, top-k, top-p)
- Built-in help and example prompts
- Conversation history tracking
Raw Text β Tokenization β Language Modeling β Base Model
- Input: TinyStories dataset
- Objective: Next token prediction
- Output:
llama125m_tinystories/pytorch_model.bin
Base Model β Extended Training β Improved Base Model
- Input: Base model checkpoint
- Objective: Continued language modeling
- Output:
llama125m_tinystories/continued_final.pt
Base Model β Instruction Data β Instruction-Following Model
- Input: Base model + Alpaca dataset
- Objective: Instruction following
- Output:
llama125m_alpaca/sft_test_final.pt
- Generates coherent short stories
- Understands basic narrative structure
- Simple language patterns
Example Output:
Prompt: "Once upon a time"
Output: "Once upon a time, there was a little boy named Timmy. Timmy loved to play outside in the park with his mommy. One day, Timmy's mommy asked him to look up..."
- Follows instructions and answers questions
- Provides explanations and how-to guides
- Handles various task types (coding, explanations, creative writing)
Example Capabilities:
- Code generation and explanation
- Question answering
- Creative writing
- Educational content
- Problem-solving assistance
batch_size: Training batch size (default: 32 for pretraining, 8 for fine-tuning)learning_rate: Learning rate (1e-4 for pretraining, 5e-5 for continued, 5e-3 for fine-tuning)num_steps: Number of training stepsmax_length: Maximum sequence lengthtemperature: Sampling temperature for generationdropout: Dropout rate for regularization
max_new_tokens: Maximum tokens to generatetemperature: Controls randomness (0.1-2.0)top_k: Top-k sampling parametertop_p: Nucleus sampling parameter
- Base Model:
pytorch_model.bin- Initial trained model - Continued Training:
continued_final.pt- Extended pretraining - Fine-tuned Model:
sft_test_final.pt- Instruction-tuned model - Intermediate Checkpoints: Saved every 500 steps during training
{
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"step": current_step,
"vocab_size": vocabulary_size
}python test_finetuned_model.pyTests the model on predefined instruction prompts covering:
- General knowledge questions
- Code generation tasks
- Creative writing
- Problem-solving
python sample_after_training.pyGenerates samples with different temperature settings and prompts.
python chat_with_model.pyFull interactive interface with customizable parameters.
Modify the dataset loading in train_llama125m.py to use your own dataset:
dataset = load_dataset("your-dataset-name", split="train")Key parameters to experiment with:
- Learning rate schedules
- Batch sizes
- Model dimensions
- Training steps
- Dropout rates
The model architecture can be customized in the LLaMA125M class:
- Change model dimensions
- Adjust number of layers/heads
- Modify feed-forward multiplier
- Update maximum sequence length
-
CUDA Out of Memory
- Reduce batch size
- Use gradient checkpointing
- Enable mixed precision training
-
Checkpoint Not Found
- Ensure previous training steps completed successfully
- Check file paths in scripts
- Verify checkpoint file integrity
-
Poor Generation Quality
- Increase training steps
- Adjust learning rate
- Try different sampling parameters
-
Training Instability
- Enable gradient clipping
- Reduce learning rate
- Add more regularization
- Use mixed precision training (
torch.cuda.amp) - Enable gradient checkpointing for memory efficiency
- Use DataLoader with multiple workers
- Optimize batch sizes for your hardware
- Base Training: Converges to ~2.5 loss after 500 steps
- Fine-tuning: Achieves instruction-following capability
- Generation Quality: Coherent text generation with proper formatting
- Minimum: 8GB GPU memory
- Recommended: 16GB+ GPU memory
- Training Time: ~30 minutes for base training on RTX 3080
Feel free to contribute by:
- Adding new datasets
- Implementing model improvements
- Optimizing training procedures
- Adding evaluation metrics
- Improving documentation
This project is open source. Please ensure compliance with dataset licenses:
- TinyStories: Check original dataset license
- Alpaca: Stanford Alpaca license terms
- LLaMA: Open and Efficient Foundation Language Models
- RoFormer: Enhanced Transformer with Rotary Position Embedding
- GLU Variants Improve Transformer
- Root Mean Square Layer Normalization
For issues and questions:
- Check the troubleshooting section
- Review the code comments
- Test with provided example scripts
- Verify your environment setup
Note: This is an educational implementation. For production use, consider using official LLaMA implementations or other established frameworks.