MAML vs FOMAML: Understanding the Difference

TL;DR: FOMAML is 2.88x faster with only 3% accuracy loss. Use it for most applications! ⚡

🎯 Quick Summary

Theoretical Comparison

Feature	MAML (Full)	FOMAML (First-Order)
Gradient Order	Second-order (through inner loop)	First-order (at adapted params only)
Accuracy	Slightly higher (~1-3%)	Slightly lower
Speed	Slower	30-50% faster ⚡
Memory	Higher	~50% less 💾
Best For	Research, small models	Large models, production

Our Experimental Results (Omniglot 5-way 1-shot, 200 training tasks)

Metric	MAML	FOMAML	Winner
Training Time	364.61s	126.49s	FOMAML (2.88x faster) ⚡
Peak Memory	1.34 GB	0.88 GB	FOMAML (34% less) 💾
Test Accuracy	74.16%	71.19%	MAML (+3%) 🎯
Final Loss	0.6613	0.7088	MAML (-7%)
Improvement	+54.16%	+51.19%	MAML (+3%)

Verdict: FOMAML offers an excellent trade-off! 🏆

🧠 Mathematical Difference

MAML (Second-Order)

Inner Loop: θ' = θ - α∇_θ L_support(θ)

Outer Loop: θ = θ - β∇_θ L_query(θ')
            = θ - β∇_θ L_query(θ - α∇_θ L_support(θ))

Key: Gradient is computed through the inner loop, requiring second-order derivatives.

FOMAML (First-Order)

Inner Loop: θ' = θ - α∇_θ L_support(θ)

Outer Loop: θ = θ - β∇_θ' L_query(θ')  [treating θ' as independent]

Key: Gradient is computed only at θ', ignoring the dependency of θ' on θ.

💡 Intuitive Explanation

MAML asks:

"How should I change my initial parameters θ so that after adaptation, I perform better?"

This requires understanding:

How changing θ affects the adaptation process
How the adapted θ' performs on the query set

FOMAML asks:

"The adapted parameters θ' work well. Let me move my initial θ towards θ'."

This only requires:

Knowing how θ' performs on the query set
Moving θ in that direction (ignoring how θ affects the adaptation)

🔧 Implementation Details

MAML Implementation

# Inner loop with computational graph
grads = torch.autograd.grad(
    loss, 
    fast_weights.values(), 
    create_graph=True  # Keep graph for second-order gradients
)

# Outer loop - backprop through entire process
query_loss.backward()  # Gradients flow through inner loop

FOMAML Implementation

# Inner loop without computational graph
grads = torch.autograd.grad(
    loss,
    fast_weights.values(),
    create_graph=False  # No second-order gradients needed
)

# Detach adapted parameters to prevent backprop through inner loop
fast_weights = {
    name: param.detach().requires_grad_(True) 
    for name, param in fast_weights.items()
}

# Outer loop - compute gradients only at θ'
grads = torch.autograd.grad(query_loss, fast_weights.values())
# Apply these gradients directly to original model parameters

📊 When to Use Which?

Use MAML when:

✅ Accuracy is critical
✅ You have sufficient computational resources
✅ Model size is small-medium
✅ You're doing research and want the "true" algorithm
✅ You want to compare against paper results

Use FOMAML when:

✅ Training speed is important
✅ Memory is limited (large models)
✅ You need faster iteration during development
✅ Deploying in production (faster adaptation)
✅ 1-3% accuracy difference is acceptable

🎓 Example Usage

Training with MAML

from algorithms.maml import train_maml

model, maml, losses = train_maml(
	model=model,
	task_dataloader=train_loader,
	inner_lr=0.01,
	outer_lr=0.001,
	inner_steps=5,
	first_order=False  # Use full MAML
)

Training with FOMAML

from algorithms.maml import train_maml

model, fomaml, losses = train_maml(
	model=model,
	task_dataloader=train_loader,
	inner_lr=0.01,
	outer_lr=0.001,
	inner_steps=5,
	first_order=True  # Use FOMAML
)

📈 Expected Performance

Theoretical Expectations (Omniglot 5-way 1-shot):

Metric	MAML	FOMAML
Test Accuracy	89-95%	87-93%
Training Time	Baseline	30-50% faster
Peak GPU Memory	Baseline	~50% less
Accuracy Loss	-	1-3%

🔬 Actual Results from Our Implementation

Experiment Setup: 200 training tasks, 100 evaluation tasks, 5-way 1-shot classification

Metric	MAML	FOMAML	Difference
Test Accuracy	74.16%	71.19%	-2.97% (-4.0%)
Training Time	364.61s	126.49s	2.88x faster ⚡
Peak GPU Memory	1.34 GB	0.88 GB	34.1% less 💾
Final Train Loss	0.6613	0.7088	+0.0475
Improvement	+54.16%	+51.19%	-3.0%

Key Observations:

FOMAML is nearly 3x faster - Even better than the typical 30-50% speedup! This is because:
- No second-order gradient computation
- Smaller computational graph
- Faster backpropagation
Memory savings of 34% - Significant reduction allows:
- Larger batch sizes
- Larger models
- More efficient GPU utilization
Accuracy difference of ~3% - Well within acceptable range:
- MAML: 74.16% accuracy
- FOMAML: 71.19% accuracy
- Both show strong improvement over baseline (~20%)
Trade-off Analysis:
- For ~3% accuracy loss, you get 3x faster training
- Can train more iterations with FOMAML in same time
- FOMAML trained for 3x longer could potentially match MAML accuracy

Practical Implications:

✅ FOMAML is excellent for development/experimentation - iterate 3x faster
✅ FOMAML is production-ready - 71% accuracy is still very strong for few-shot learning
✅ MAML for final benchmarks - use when you need that extra 3% for publications
✅ Consider FOMAML with more training tasks - could close the accuracy gap

🔬 Technical Deep Dive

Why is FOMAML Faster?

No Second-Order Gradients:
- MAML: Must compute ∂²L/∂θ² (Hessian information)
- FOMAML: Only computes ∂L/∂θ (Jacobian)
No Computational Graph Storage:
- MAML: Stores entire computation graph through inner loop
- FOMAML: Discards graph after each inner step
Simpler Backpropagation:
- MAML: Backprop through inner_steps adaptation steps
- FOMAML: Single backprop at final adapted parameters

Why is FOMAML (Almost) As Good?

Research shows that:

The Hessian (second-order information) is often close to zero
The first-order gradient direction is usually good enough
Meta-learning works by finding good starting points, not perfect gradients

📚 References

MAML: Finn et al., "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks", ICML 2017
- Introduces full second-order MAML
FOMAML: Also in Finn et al. (2017), Appendix
- Shows first-order approximation works well
Nichol et al., "On First-Order Meta-Learning Algorithms", 2018
- Analyzes why first-order methods work
- Introduces Reptile (another first-order method)

💻 Code Comparison

Full Training Loop Comparison

MAML:

# Inner loop keeps gradients
fast_weights = self.inner_update(support)  # create_graph=True

# Outer loop backprops through everything
query_loss = compute_loss(query, fast_weights)
query_loss.backward()  # Backprop through inner loop

FOMAML:

# Inner loop discards gradients
fast_weights = self.inner_update(support)  # create_graph=False
fast_weights = {k: v.detach().requires_grad_() for k, v in fast_weights.items()}

# Outer loop only at final parameters
query_loss = compute_loss(query, fast_weights)
grads = torch.autograd.grad(query_loss, fast_weights.values())
# Apply grads directly to original model

🎯 Practical Recommendations

For Learning/Research:

Start with MAML to understand the full algorithm
Compare both to see the trade-offs
Use FOMAML for faster experimentation

For Production:

Use FOMAML unless accuracy is critical
The speed/memory benefits usually outweigh accuracy loss
Can train longer with FOMAML to close accuracy gap

For Large Models (e.g., Vision Transformers):

FOMAML is essential - MAML may not fit in memory
Consider even simpler methods like Reptile

🧪 Running the Comparison

We provide a comparison script that you can run yourself:

cd examples
python compare_maml_fomaml.py

This will:

Train both MAML and FOMAML on the same data
Compare training time, memory usage, and accuracy
Provide recommendations based on your use case

📊 Sample Output from Our Experiments

================================================================================
MAML vs FOMAML Comparison
================================================================================

📂 Loading Omniglot dataset...
Found 964 character classes

Creating sample tasks...
Task dataset created!

================================================================================
Training with MAML
================================================================================
Starting MAML training...
Hyperparameters: inner_lr=0.01, outer_lr=0.001, inner_steps=5
Optimizer: Adam
Training: 100%|████████████████████████| 50/50 [06:04<00:00,  7.29s/it]

Training completed! Final loss: 0.6613

✅ MAML Results:
   Training Time: 364.61s
   Peak Memory: 1.34 GB
   Final Train Loss: 0.6613
   Test Accuracy: 0.7416
   Improvement: 0.5416

================================================================================
Training with FOMAML
================================================================================
Starting FOMAML training...
Hyperparameters: inner_lr=0.01, outer_lr=0.001, inner_steps=5
Optimizer: Adam
Using First-Order approximation (FOMAML) - faster but slightly less accurate
Training: 100%|████████████████████████| 50/50 [02:06<00:00,  2.53s/it]

Training completed! Final loss: 0.7088

✅ FOMAML Results:
   Training Time: 126.49s
   Peak Memory: 0.88 GB
   Final Train Loss: 0.7088
   Test Accuracy: 0.7119
   Improvement: 0.5119

================================================================================
📊 COMPARISON SUMMARY
================================================================================

⚡ Speed:
   MAML:   364.61s
   FOMAML: 126.49s
   ➜ FOMAML is 2.88x faster

💾 Memory:
   MAML:   1.34 GB
   FOMAML: 0.88 GB
   ➜ FOMAML uses 34.1% less memory

🎯 Accuracy:
   MAML:   0.7416
   FOMAML: 0.7119
   ➜ Difference: 0.0297 (4.01%)

💡 Recommendation:
   ✅ FOMAML is recommended: Similar accuracy with significant speedup

🎯 Interpretation

The results show that FOMAML offers an excellent trade-off:

Nearly 3x faster training allows rapid experimentation
34% memory reduction enables larger models or batch sizes
Only ~3% accuracy loss is negligible for most applications
Both algorithms show strong meta-learning (>50% improvement over baseline)

❓ FAQ

Q: Is FOMAML always faster?
A: Yes! In our experiments, FOMAML was 2.88x faster (364s vs 126s). The speedup depends on model size and inner_steps, but typically ranges from 2-3x.

Q: How much accuracy do I lose with FOMAML?
A: In our experiments, ~3% accuracy loss (74.16% vs 71.19%). This is slightly higher than the typical 1-3% but still represents excellent performance for few-shot learning. Both algorithms achieved >70% accuracy on 5-way 1-shot tasks.

Q: Is the speed gain worth the accuracy loss?
A: Absolutely! Consider this: With FOMAML being 3x faster, you could train for 3x more iterations in the same time. This could potentially close or even reverse the accuracy gap while still being faster overall.

Q: How much memory does FOMAML save?
A: In our experiments, FOMAML used 34% less memory (0.88 GB vs 1.34 GB). This is crucial for:

Training larger models
Using bigger batch sizes
Running on resource-constrained GPUs

Q: Can I switch between MAML and FOMAML during training?
A: Technically yes, but not recommended. Stick with one approach for consistency. However, a common strategy is:

Use FOMAML for rapid hyperparameter search
Train final model with MAML for maximum accuracy

Q: Which do the papers use?
A: Most papers report MAML (second-order) results for benchmarking, but many implementations use FOMAML in practice due to its efficiency.

Q: Should I tune hyperparameters differently for FOMAML?
A: Usually the same hyperparameters work. However, FOMAML might benefit from:

Slightly higher outer_lr (e.g., 0.001 → 0.0015)
More training tasks to compensate for less accurate gradients
Similar or slightly fewer inner_steps

Q: Why is my speedup different from 2.88x?
A: Speedup varies based on:

Model size (larger models = more speedup)
inner_steps value (more steps = more speedup)
Hardware (GPU architecture affects second-order gradient computation)
Batch size (larger batches amortize overhead)

Q: When should I definitely use FOMAML?
A: Use FOMAML when:

You're experimenting and need fast iteration
Training on large models (ResNets, Transformers)
GPU memory is limited
Training time is a bottleneck
3-5% accuracy loss is acceptable for your application

Q: When should I stick with MAML?
A: Use MAML when:

Publishing benchmark results
Every percentage point of accuracy matters
You have abundant computational resources
Training time is not a constraint
You want to match paper results exactly

🎓 Summary

FOMAML is a practical approximation of MAML that trades a small amount of accuracy for significant gains in speed and memory efficiency.

Key Takeaways from Our Experiments:

✅ Speed: FOMAML is 2.88x faster (126s vs 364s)
✅ Memory: FOMAML uses 34% less memory (0.88 GB vs 1.34 GB)
✅ Accuracy: Only 3% accuracy loss (71.19% vs 74.16%)
✅ Both algorithms show strong meta-learning: >50% improvement over random baseline

💡 Final Recommendation:

For most applications, FOMAML is the better choice! ⚡
The 3x speedup enables faster experimentation and iteration
The 3% accuracy loss is negligible compared to the efficiency gains
You can train FOMAML longer to potentially match or exceed MAML accuracy
Use MAML only when you need that last few percent for benchmarks or publications

🚀 Getting Started:

# Just flip one boolean to switch!
from algorithms.maml import train_maml

# Fast training with FOMAML
model, fomaml, losses = train_maml(
	model=model,
	task_dataloader=train_loader,
	first_order=True  # ← That's it!
)

Ready to experiment? Run python examples/compare_maml_fomaml.py to see the difference yourself!

FilesExpand file tree

MAML_vs_FOMAML.md

Latest commit

History