Skip to content

Latest commit

 

History

History
458 lines (348 loc) · 14.8 KB

File metadata and controls

458 lines (348 loc) · 14.8 KB

MAML vs FOMAML: Understanding the Difference

TL;DR: FOMAML is 2.88x faster with only 3% accuracy loss. Use it for most applications! ⚡

🎯 Quick Summary

Theoretical Comparison

Feature MAML (Full) FOMAML (First-Order)
Gradient Order Second-order (through inner loop) First-order (at adapted params only)
Accuracy Slightly higher (~1-3%) Slightly lower
Speed Slower 30-50% faster ⚡
Memory Higher ~50% less 💾
Best For Research, small models Large models, production

Our Experimental Results (Omniglot 5-way 1-shot, 200 training tasks)

Metric MAML FOMAML Winner
Training Time 364.61s 126.49s FOMAML (2.88x faster) ⚡
Peak Memory 1.34 GB 0.88 GB FOMAML (34% less) 💾
Test Accuracy 74.16% 71.19% MAML (+3%) 🎯
Final Loss 0.6613 0.7088 MAML (-7%)
Improvement +54.16% +51.19% MAML (+3%)

Verdict: FOMAML offers an excellent trade-off! 🏆

🧠 Mathematical Difference

MAML (Second-Order)

Inner Loop: θ' = θ - α∇_θ L_support(θ)

Outer Loop: θ = θ - β∇_θ L_query(θ')
            = θ - β∇_θ L_query(θ - α∇_θ L_support(θ))

Key: Gradient is computed through the inner loop, requiring second-order derivatives.

FOMAML (First-Order)

Inner Loop: θ' = θ - α∇_θ L_support(θ)

Outer Loop: θ = θ - β∇_θ' L_query(θ')  [treating θ' as independent]

Key: Gradient is computed only at θ', ignoring the dependency of θ' on θ.

💡 Intuitive Explanation

MAML asks:

"How should I change my initial parameters θ so that after adaptation, I perform better?"

This requires understanding:

  1. How changing θ affects the adaptation process
  2. How the adapted θ' performs on the query set

FOMAML asks:

"The adapted parameters θ' work well. Let me move my initial θ towards θ'."

This only requires:

  1. Knowing how θ' performs on the query set
  2. Moving θ in that direction (ignoring how θ affects the adaptation)

🔧 Implementation Details

MAML Implementation

# Inner loop with computational graph
grads = torch.autograd.grad(
    loss, 
    fast_weights.values(), 
    create_graph=True  # Keep graph for second-order gradients
)

# Outer loop - backprop through entire process
query_loss.backward()  # Gradients flow through inner loop

FOMAML Implementation

# Inner loop without computational graph
grads = torch.autograd.grad(
    loss,
    fast_weights.values(),
    create_graph=False  # No second-order gradients needed
)

# Detach adapted parameters to prevent backprop through inner loop
fast_weights = {
    name: param.detach().requires_grad_(True) 
    for name, param in fast_weights.items()
}

# Outer loop - compute gradients only at θ'
grads = torch.autograd.grad(query_loss, fast_weights.values())
# Apply these gradients directly to original model parameters

📊 When to Use Which?

Use MAML when:

  • ✅ Accuracy is critical
  • ✅ You have sufficient computational resources
  • ✅ Model size is small-medium
  • ✅ You're doing research and want the "true" algorithm
  • ✅ You want to compare against paper results

Use FOMAML when:

  • ✅ Training speed is important
  • ✅ Memory is limited (large models)
  • ✅ You need faster iteration during development
  • ✅ Deploying in production (faster adaptation)
  • ✅ 1-3% accuracy difference is acceptable

🎓 Example Usage

Training with MAML

from algorithms.maml import train_maml

model, maml, losses = train_maml(
	model=model,
	task_dataloader=train_loader,
	inner_lr=0.01,
	outer_lr=0.001,
	inner_steps=5,
	first_order=False  # Use full MAML
)

Training with FOMAML

from algorithms.maml import train_maml

model, fomaml, losses = train_maml(
	model=model,
	task_dataloader=train_loader,
	inner_lr=0.01,
	outer_lr=0.001,
	inner_steps=5,
	first_order=True  # Use FOMAML
)

📈 Expected Performance

Theoretical Expectations (Omniglot 5-way 1-shot):

Metric MAML FOMAML
Test Accuracy 89-95% 87-93%
Training Time Baseline 30-50% faster
Peak GPU Memory Baseline ~50% less
Accuracy Loss - 1-3%

🔬 Actual Results from Our Implementation

Experiment Setup: 200 training tasks, 100 evaluation tasks, 5-way 1-shot classification

Metric MAML FOMAML Difference
Test Accuracy 74.16% 71.19% -2.97% (-4.0%)
Training Time 364.61s 126.49s 2.88x faster
Peak GPU Memory 1.34 GB 0.88 GB 34.1% less 💾
Final Train Loss 0.6613 0.7088 +0.0475
Improvement +54.16% +51.19% -3.0%

Key Observations:

  1. FOMAML is nearly 3x faster - Even better than the typical 30-50% speedup! This is because:

    • No second-order gradient computation
    • Smaller computational graph
    • Faster backpropagation
  2. Memory savings of 34% - Significant reduction allows:

    • Larger batch sizes
    • Larger models
    • More efficient GPU utilization
  3. Accuracy difference of ~3% - Well within acceptable range:

    • MAML: 74.16% accuracy
    • FOMAML: 71.19% accuracy
    • Both show strong improvement over baseline (~20%)
  4. Trade-off Analysis:

    • For ~3% accuracy loss, you get 3x faster training
    • Can train more iterations with FOMAML in same time
    • FOMAML trained for 3x longer could potentially match MAML accuracy

Practical Implications:

  • FOMAML is excellent for development/experimentation - iterate 3x faster
  • FOMAML is production-ready - 71% accuracy is still very strong for few-shot learning
  • MAML for final benchmarks - use when you need that extra 3% for publications
  • Consider FOMAML with more training tasks - could close the accuracy gap

🔬 Technical Deep Dive

Why is FOMAML Faster?

  1. No Second-Order Gradients:

    • MAML: Must compute ∂²L/∂θ² (Hessian information)
    • FOMAML: Only computes ∂L/∂θ (Jacobian)
  2. No Computational Graph Storage:

    • MAML: Stores entire computation graph through inner loop
    • FOMAML: Discards graph after each inner step
  3. Simpler Backpropagation:

    • MAML: Backprop through inner_steps adaptation steps
    • FOMAML: Single backprop at final adapted parameters

Why is FOMAML (Almost) As Good?

Research shows that:

  1. The Hessian (second-order information) is often close to zero
  2. The first-order gradient direction is usually good enough
  3. Meta-learning works by finding good starting points, not perfect gradients

📚 References

  1. MAML: Finn et al., "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks", ICML 2017

    • Introduces full second-order MAML
  2. FOMAML: Also in Finn et al. (2017), Appendix

    • Shows first-order approximation works well
  3. Nichol et al., "On First-Order Meta-Learning Algorithms", 2018

    • Analyzes why first-order methods work
    • Introduces Reptile (another first-order method)

💻 Code Comparison

Full Training Loop Comparison

MAML:

# Inner loop keeps gradients
fast_weights = self.inner_update(support)  # create_graph=True

# Outer loop backprops through everything
query_loss = compute_loss(query, fast_weights)
query_loss.backward()  # Backprop through inner loop

FOMAML:

# Inner loop discards gradients
fast_weights = self.inner_update(support)  # create_graph=False
fast_weights = {k: v.detach().requires_grad_() for k, v in fast_weights.items()}

# Outer loop only at final parameters
query_loss = compute_loss(query, fast_weights)
grads = torch.autograd.grad(query_loss, fast_weights.values())
# Apply grads directly to original model

🎯 Practical Recommendations

For Learning/Research:

  • Start with MAML to understand the full algorithm
  • Compare both to see the trade-offs
  • Use FOMAML for faster experimentation

For Production:

  • Use FOMAML unless accuracy is critical
  • The speed/memory benefits usually outweigh accuracy loss
  • Can train longer with FOMAML to close accuracy gap

For Large Models (e.g., Vision Transformers):

  • FOMAML is essential - MAML may not fit in memory
  • Consider even simpler methods like Reptile

🧪 Running the Comparison

We provide a comparison script that you can run yourself:

cd examples
python compare_maml_fomaml.py

This will:

  • Train both MAML and FOMAML on the same data
  • Compare training time, memory usage, and accuracy
  • Provide recommendations based on your use case

📊 Sample Output from Our Experiments

================================================================================
MAML vs FOMAML Comparison
================================================================================

📂 Loading Omniglot dataset...
Found 964 character classes

Creating sample tasks...
Task dataset created!

================================================================================
Training with MAML
================================================================================
Starting MAML training...
Hyperparameters: inner_lr=0.01, outer_lr=0.001, inner_steps=5
Optimizer: Adam
Training: 100%|████████████████████████| 50/50 [06:04<00:00,  7.29s/it]

Training completed! Final loss: 0.6613

✅ MAML Results:
   Training Time: 364.61s
   Peak Memory: 1.34 GB
   Final Train Loss: 0.6613
   Test Accuracy: 0.7416
   Improvement: 0.5416

================================================================================
Training with FOMAML
================================================================================
Starting FOMAML training...
Hyperparameters: inner_lr=0.01, outer_lr=0.001, inner_steps=5
Optimizer: Adam
Using First-Order approximation (FOMAML) - faster but slightly less accurate
Training: 100%|████████████████████████| 50/50 [02:06<00:00,  2.53s/it]

Training completed! Final loss: 0.7088

✅ FOMAML Results:
   Training Time: 126.49s
   Peak Memory: 0.88 GB
   Final Train Loss: 0.7088
   Test Accuracy: 0.7119
   Improvement: 0.5119

================================================================================
📊 COMPARISON SUMMARY
================================================================================

⚡ Speed:
   MAML:   364.61s
   FOMAML: 126.49s
   ➜ FOMAML is 2.88x faster

💾 Memory:
   MAML:   1.34 GB
   FOMAML: 0.88 GB
   ➜ FOMAML uses 34.1% less memory

🎯 Accuracy:
   MAML:   0.7416
   FOMAML: 0.7119
   ➜ Difference: 0.0297 (4.01%)

💡 Recommendation:
   ✅ FOMAML is recommended: Similar accuracy with significant speedup

🎯 Interpretation

The results show that FOMAML offers an excellent trade-off:

  • Nearly 3x faster training allows rapid experimentation
  • 34% memory reduction enables larger models or batch sizes
  • Only ~3% accuracy loss is negligible for most applications
  • Both algorithms show strong meta-learning (>50% improvement over baseline)

❓ FAQ

Q: Is FOMAML always faster?
A: Yes! In our experiments, FOMAML was 2.88x faster (364s vs 126s). The speedup depends on model size and inner_steps, but typically ranges from 2-3x.

Q: How much accuracy do I lose with FOMAML?
A: In our experiments, ~3% accuracy loss (74.16% vs 71.19%). This is slightly higher than the typical 1-3% but still represents excellent performance for few-shot learning. Both algorithms achieved >70% accuracy on 5-way 1-shot tasks.

Q: Is the speed gain worth the accuracy loss?
A: Absolutely! Consider this: With FOMAML being 3x faster, you could train for 3x more iterations in the same time. This could potentially close or even reverse the accuracy gap while still being faster overall.

Q: How much memory does FOMAML save?
A: In our experiments, FOMAML used 34% less memory (0.88 GB vs 1.34 GB). This is crucial for:

  • Training larger models
  • Using bigger batch sizes
  • Running on resource-constrained GPUs

Q: Can I switch between MAML and FOMAML during training?
A: Technically yes, but not recommended. Stick with one approach for consistency. However, a common strategy is:

  1. Use FOMAML for rapid hyperparameter search
  2. Train final model with MAML for maximum accuracy

Q: Which do the papers use?
A: Most papers report MAML (second-order) results for benchmarking, but many implementations use FOMAML in practice due to its efficiency.

Q: Should I tune hyperparameters differently for FOMAML?
A: Usually the same hyperparameters work. However, FOMAML might benefit from:

  • Slightly higher outer_lr (e.g., 0.001 → 0.0015)
  • More training tasks to compensate for less accurate gradients
  • Similar or slightly fewer inner_steps

Q: Why is my speedup different from 2.88x?
A: Speedup varies based on:

  • Model size (larger models = more speedup)
  • inner_steps value (more steps = more speedup)
  • Hardware (GPU architecture affects second-order gradient computation)
  • Batch size (larger batches amortize overhead)

Q: When should I definitely use FOMAML?
A: Use FOMAML when:

  • You're experimenting and need fast iteration
  • Training on large models (ResNets, Transformers)
  • GPU memory is limited
  • Training time is a bottleneck
  • 3-5% accuracy loss is acceptable for your application

Q: When should I stick with MAML?
A: Use MAML when:

  • Publishing benchmark results
  • Every percentage point of accuracy matters
  • You have abundant computational resources
  • Training time is not a constraint
  • You want to match paper results exactly

🎓 Summary

FOMAML is a practical approximation of MAML that trades a small amount of accuracy for significant gains in speed and memory efficiency.

Key Takeaways from Our Experiments:

Speed: FOMAML is 2.88x faster (126s vs 364s)
Memory: FOMAML uses 34% less memory (0.88 GB vs 1.34 GB)
Accuracy: Only 3% accuracy loss (71.19% vs 74.16%)
Both algorithms show strong meta-learning: >50% improvement over random baseline

💡 Final Recommendation:

  • For most applications, FOMAML is the better choice!
  • The 3x speedup enables faster experimentation and iteration
  • The 3% accuracy loss is negligible compared to the efficiency gains
  • You can train FOMAML longer to potentially match or exceed MAML accuracy
  • Use MAML only when you need that last few percent for benchmarks or publications

🚀 Getting Started:

# Just flip one boolean to switch!
from algorithms.maml import train_maml

# Fast training with FOMAML
model, fomaml, losses = train_maml(
	model=model,
	task_dataloader=train_loader,
	first_order=True  # ← That's it!
)

Ready to experiment? Run python examples/compare_maml_fomaml.py to see the difference yourself!