TL;DR: FOMAML is 2.88x faster with only 3% accuracy loss. Use it for most applications! ⚡
| Feature | MAML (Full) | FOMAML (First-Order) |
|---|---|---|
| Gradient Order | Second-order (through inner loop) | First-order (at adapted params only) |
| Accuracy | Slightly higher (~1-3%) | Slightly lower |
| Speed | Slower | 30-50% faster ⚡ |
| Memory | Higher | ~50% less 💾 |
| Best For | Research, small models | Large models, production |
| Metric | MAML | FOMAML | Winner |
|---|---|---|---|
| Training Time | 364.61s | 126.49s | FOMAML (2.88x faster) ⚡ |
| Peak Memory | 1.34 GB | 0.88 GB | FOMAML (34% less) 💾 |
| Test Accuracy | 74.16% | 71.19% | MAML (+3%) 🎯 |
| Final Loss | 0.6613 | 0.7088 | MAML (-7%) |
| Improvement | +54.16% | +51.19% | MAML (+3%) |
Verdict: FOMAML offers an excellent trade-off! 🏆
Inner Loop: θ' = θ - α∇_θ L_support(θ)
Outer Loop: θ = θ - β∇_θ L_query(θ')
= θ - β∇_θ L_query(θ - α∇_θ L_support(θ))
Key: Gradient is computed through the inner loop, requiring second-order derivatives.
Inner Loop: θ' = θ - α∇_θ L_support(θ)
Outer Loop: θ = θ - β∇_θ' L_query(θ') [treating θ' as independent]
Key: Gradient is computed only at θ', ignoring the dependency of θ' on θ.
"How should I change my initial parameters θ so that after adaptation, I perform better?"
This requires understanding:
- How changing θ affects the adaptation process
- How the adapted θ' performs on the query set
"The adapted parameters θ' work well. Let me move my initial θ towards θ'."
This only requires:
- Knowing how θ' performs on the query set
- Moving θ in that direction (ignoring how θ affects the adaptation)
# Inner loop with computational graph
grads = torch.autograd.grad(
loss,
fast_weights.values(),
create_graph=True # Keep graph for second-order gradients
)
# Outer loop - backprop through entire process
query_loss.backward() # Gradients flow through inner loop# Inner loop without computational graph
grads = torch.autograd.grad(
loss,
fast_weights.values(),
create_graph=False # No second-order gradients needed
)
# Detach adapted parameters to prevent backprop through inner loop
fast_weights = {
name: param.detach().requires_grad_(True)
for name, param in fast_weights.items()
}
# Outer loop - compute gradients only at θ'
grads = torch.autograd.grad(query_loss, fast_weights.values())
# Apply these gradients directly to original model parameters- ✅ Accuracy is critical
- ✅ You have sufficient computational resources
- ✅ Model size is small-medium
- ✅ You're doing research and want the "true" algorithm
- ✅ You want to compare against paper results
- ✅ Training speed is important
- ✅ Memory is limited (large models)
- ✅ You need faster iteration during development
- ✅ Deploying in production (faster adaptation)
- ✅ 1-3% accuracy difference is acceptable
from algorithms.maml import train_maml
model, maml, losses = train_maml(
model=model,
task_dataloader=train_loader,
inner_lr=0.01,
outer_lr=0.001,
inner_steps=5,
first_order=False # Use full MAML
)from algorithms.maml import train_maml
model, fomaml, losses = train_maml(
model=model,
task_dataloader=train_loader,
inner_lr=0.01,
outer_lr=0.001,
inner_steps=5,
first_order=True # Use FOMAML
)| Metric | MAML | FOMAML |
|---|---|---|
| Test Accuracy | 89-95% | 87-93% |
| Training Time | Baseline | 30-50% faster |
| Peak GPU Memory | Baseline | ~50% less |
| Accuracy Loss | - | 1-3% |
Experiment Setup: 200 training tasks, 100 evaluation tasks, 5-way 1-shot classification
| Metric | MAML | FOMAML | Difference |
|---|---|---|---|
| Test Accuracy | 74.16% | 71.19% | -2.97% (-4.0%) |
| Training Time | 364.61s | 126.49s | 2.88x faster ⚡ |
| Peak GPU Memory | 1.34 GB | 0.88 GB | 34.1% less 💾 |
| Final Train Loss | 0.6613 | 0.7088 | +0.0475 |
| Improvement | +54.16% | +51.19% | -3.0% |
-
FOMAML is nearly 3x faster - Even better than the typical 30-50% speedup! This is because:
- No second-order gradient computation
- Smaller computational graph
- Faster backpropagation
-
Memory savings of 34% - Significant reduction allows:
- Larger batch sizes
- Larger models
- More efficient GPU utilization
-
Accuracy difference of ~3% - Well within acceptable range:
- MAML: 74.16% accuracy
- FOMAML: 71.19% accuracy
- Both show strong improvement over baseline (~20%)
-
Trade-off Analysis:
- For ~3% accuracy loss, you get 3x faster training
- Can train more iterations with FOMAML in same time
- FOMAML trained for 3x longer could potentially match MAML accuracy
- ✅ FOMAML is excellent for development/experimentation - iterate 3x faster
- ✅ FOMAML is production-ready - 71% accuracy is still very strong for few-shot learning
- ✅ MAML for final benchmarks - use when you need that extra 3% for publications
- ✅ Consider FOMAML with more training tasks - could close the accuracy gap
-
No Second-Order Gradients:
- MAML: Must compute ∂²L/∂θ² (Hessian information)
- FOMAML: Only computes ∂L/∂θ (Jacobian)
-
No Computational Graph Storage:
- MAML: Stores entire computation graph through inner loop
- FOMAML: Discards graph after each inner step
-
Simpler Backpropagation:
- MAML: Backprop through
inner_stepsadaptation steps - FOMAML: Single backprop at final adapted parameters
- MAML: Backprop through
Research shows that:
- The Hessian (second-order information) is often close to zero
- The first-order gradient direction is usually good enough
- Meta-learning works by finding good starting points, not perfect gradients
-
MAML: Finn et al., "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks", ICML 2017
- Introduces full second-order MAML
-
FOMAML: Also in Finn et al. (2017), Appendix
- Shows first-order approximation works well
-
Nichol et al., "On First-Order Meta-Learning Algorithms", 2018
- Analyzes why first-order methods work
- Introduces Reptile (another first-order method)
MAML:
# Inner loop keeps gradients
fast_weights = self.inner_update(support) # create_graph=True
# Outer loop backprops through everything
query_loss = compute_loss(query, fast_weights)
query_loss.backward() # Backprop through inner loopFOMAML:
# Inner loop discards gradients
fast_weights = self.inner_update(support) # create_graph=False
fast_weights = {k: v.detach().requires_grad_() for k, v in fast_weights.items()}
# Outer loop only at final parameters
query_loss = compute_loss(query, fast_weights)
grads = torch.autograd.grad(query_loss, fast_weights.values())
# Apply grads directly to original model- Start with MAML to understand the full algorithm
- Compare both to see the trade-offs
- Use FOMAML for faster experimentation
- Use FOMAML unless accuracy is critical
- The speed/memory benefits usually outweigh accuracy loss
- Can train longer with FOMAML to close accuracy gap
- FOMAML is essential - MAML may not fit in memory
- Consider even simpler methods like Reptile
We provide a comparison script that you can run yourself:
cd examples
python compare_maml_fomaml.pyThis will:
- Train both MAML and FOMAML on the same data
- Compare training time, memory usage, and accuracy
- Provide recommendations based on your use case
================================================================================
MAML vs FOMAML Comparison
================================================================================
📂 Loading Omniglot dataset...
Found 964 character classes
Creating sample tasks...
Task dataset created!
================================================================================
Training with MAML
================================================================================
Starting MAML training...
Hyperparameters: inner_lr=0.01, outer_lr=0.001, inner_steps=5
Optimizer: Adam
Training: 100%|████████████████████████| 50/50 [06:04<00:00, 7.29s/it]
Training completed! Final loss: 0.6613
✅ MAML Results:
Training Time: 364.61s
Peak Memory: 1.34 GB
Final Train Loss: 0.6613
Test Accuracy: 0.7416
Improvement: 0.5416
================================================================================
Training with FOMAML
================================================================================
Starting FOMAML training...
Hyperparameters: inner_lr=0.01, outer_lr=0.001, inner_steps=5
Optimizer: Adam
Using First-Order approximation (FOMAML) - faster but slightly less accurate
Training: 100%|████████████████████████| 50/50 [02:06<00:00, 2.53s/it]
Training completed! Final loss: 0.7088
✅ FOMAML Results:
Training Time: 126.49s
Peak Memory: 0.88 GB
Final Train Loss: 0.7088
Test Accuracy: 0.7119
Improvement: 0.5119
================================================================================
📊 COMPARISON SUMMARY
================================================================================
⚡ Speed:
MAML: 364.61s
FOMAML: 126.49s
➜ FOMAML is 2.88x faster
💾 Memory:
MAML: 1.34 GB
FOMAML: 0.88 GB
➜ FOMAML uses 34.1% less memory
🎯 Accuracy:
MAML: 0.7416
FOMAML: 0.7119
➜ Difference: 0.0297 (4.01%)
💡 Recommendation:
✅ FOMAML is recommended: Similar accuracy with significant speedup
The results show that FOMAML offers an excellent trade-off:
- Nearly 3x faster training allows rapid experimentation
- 34% memory reduction enables larger models or batch sizes
- Only ~3% accuracy loss is negligible for most applications
- Both algorithms show strong meta-learning (>50% improvement over baseline)
Q: Is FOMAML always faster?
A: Yes! In our experiments, FOMAML was 2.88x faster (364s vs 126s). The speedup depends on model size and inner_steps, but typically ranges from 2-3x.
Q: How much accuracy do I lose with FOMAML?
A: In our experiments, ~3% accuracy loss (74.16% vs 71.19%). This is slightly higher than the typical 1-3% but still represents excellent performance for few-shot learning. Both algorithms achieved >70% accuracy on 5-way 1-shot tasks.
Q: Is the speed gain worth the accuracy loss?
A: Absolutely! Consider this: With FOMAML being 3x faster, you could train for 3x more iterations in the same time. This could potentially close or even reverse the accuracy gap while still being faster overall.
Q: How much memory does FOMAML save?
A: In our experiments, FOMAML used 34% less memory (0.88 GB vs 1.34 GB). This is crucial for:
- Training larger models
- Using bigger batch sizes
- Running on resource-constrained GPUs
Q: Can I switch between MAML and FOMAML during training?
A: Technically yes, but not recommended. Stick with one approach for consistency. However, a common strategy is:
- Use FOMAML for rapid hyperparameter search
- Train final model with MAML for maximum accuracy
Q: Which do the papers use?
A: Most papers report MAML (second-order) results for benchmarking, but many implementations use FOMAML in practice due to its efficiency.
Q: Should I tune hyperparameters differently for FOMAML?
A: Usually the same hyperparameters work. However, FOMAML might benefit from:
- Slightly higher
outer_lr(e.g., 0.001 → 0.0015) - More training tasks to compensate for less accurate gradients
- Similar or slightly fewer
inner_steps
Q: Why is my speedup different from 2.88x?
A: Speedup varies based on:
- Model size (larger models = more speedup)
inner_stepsvalue (more steps = more speedup)- Hardware (GPU architecture affects second-order gradient computation)
- Batch size (larger batches amortize overhead)
Q: When should I definitely use FOMAML?
A: Use FOMAML when:
- You're experimenting and need fast iteration
- Training on large models (ResNets, Transformers)
- GPU memory is limited
- Training time is a bottleneck
- 3-5% accuracy loss is acceptable for your application
Q: When should I stick with MAML?
A: Use MAML when:
- Publishing benchmark results
- Every percentage point of accuracy matters
- You have abundant computational resources
- Training time is not a constraint
- You want to match paper results exactly
FOMAML is a practical approximation of MAML that trades a small amount of accuracy for significant gains in speed and memory efficiency.
✅ Speed: FOMAML is 2.88x faster (126s vs 364s)
✅ Memory: FOMAML uses 34% less memory (0.88 GB vs 1.34 GB)
✅ Accuracy: Only 3% accuracy loss (71.19% vs 74.16%)
✅ Both algorithms show strong meta-learning: >50% improvement over random baseline
- For most applications, FOMAML is the better choice! ⚡
- The 3x speedup enables faster experimentation and iteration
- The 3% accuracy loss is negligible compared to the efficiency gains
- You can train FOMAML longer to potentially match or exceed MAML accuracy
- Use MAML only when you need that last few percent for benchmarks or publications
# Just flip one boolean to switch!
from algorithms.maml import train_maml
# Fast training with FOMAML
model, fomaml, losses = train_maml(
model=model,
task_dataloader=train_loader,
first_order=True # ← That's it!
)Ready to experiment? Run python examples/compare_maml_fomaml.py to see the difference yourself!