Skip to content

Commit c104b9d

Browse files
committed
More benchmarking data and analysis
1 parent 81a41a5 commit c104b9d

9 files changed

+3942
-0
lines changed

benchmarking/metashape/COMPREHENSIVE_ANALYSIS_REPORT.md

Lines changed: 1253 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 327 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
# Photogrammetry Workflow Scaling Analysis - Key Findings
2+
3+
## Understanding Efficiency Metrics
4+
5+
**Efficiency = (Speedup / CoreRatio) × 100%**
6+
7+
- **100% = Perfect Scaling**: Doubling cores doubles speed
8+
- **>100% = Super-linear**: Better than expected (cache effects, memory bandwidth)
9+
- **<100% = Sub-linear**: Parallelization overhead
10+
11+
**Why can efficiency exceed 100%?**
12+
1. **Cache effects**: More cores = more L3 cache
13+
2. **Memory bandwidth**: Better utilization with more cores
14+
3. **NUMA locality**: Better memory placement
15+
4. **Reduced contention**: Less lock contention per core
16+
17+
## PART 1: CPU Scaling Analysis (16c vs 32c)
18+
19+
### Summary by Step Category
20+
21+
| Step Type | Avg Efficiency | Avg Speedup | CPU% (16c) | CPU% (32c) | Recommendation |
22+
|-----------|---------------|-------------|------------|------------|----------------|
23+
| **match_photos** | 115% ✓ | 2.30x | 57% | 8% | **Use 32c - super-linear!** |
24+
| **build_point_cloud** (buildPointCloud) | 78% | 1.57x | 73% | 60% | Use 32c or bin-pack |
25+
| **build_point_cloud** (classifyGroundPoints) | 85% ✓ | 1.70x | 80% | 71% | **Use 32c - good scaling** |
26+
| **align_cameras** (alignCameras) | 70% | 1.41x | 77% | 69% | Consider 16c or bin-pack |
27+
| **align_cameras** (optimizeCameras) | 62% | 1.23x | 62% | 48% | Consider 16c or bin-pack |
28+
| **build_mesh** (buildModel) | 69% | 1.37x | 52% | 36% | Consider 16c or bin-pack |
29+
| **build_dem_orthomosaic** | 59% | 1.19x | 23% | 15% | **Use 16c or bin-pack** |
30+
| **build_depth_maps** | 51% | 1.02x | 18% | 18% | **Use 16c or bin-pack** |
31+
| **setup** | 47% | 0.94x | 5% | 2% | **Use 16c or bin-pack** |
32+
33+
### Detailed Step-by-Step Findings
34+
35+
#### 1. match_photos / matchPhotos ⭐ **SUPER-LINEAR SCALING**
36+
37+
**Performance:** 115% efficiency (2.30x speedup)
38+
39+
**Key Finding:** This is the ONLY step that shows super-linear scaling!
40+
41+
**Variability:** HIGH (±27% std dev) - performance depends on dataset:
42+
- Best: Project 000195: **137% efficiency** (2300s → 838s)
43+
- Worst: benchmarking-emerald-subset: **56% efficiency** (64s → 57s)
44+
45+
**Why super-linear?**
46+
- CPU% drops from 57% to 8% on 32c, indicating this is NOT compute-bound
47+
- Likely memory/cache bound - benefits from 2x more L3 cache on 32-core system
48+
- Algorithm parallelizes extremely well with more cores
49+
50+
**Recommendation:****Strongly use m3.xl (32c) for this step**
51+
52+
---
53+
54+
#### 2. build_point_cloud / classifyGroundPoints ⭐ **EXCELLENT SCALING**
55+
56+
**Performance:** 85% efficiency (1.70x speedup)
57+
58+
**Key Finding:** Near-linear scaling, one project achieved perfect 100%
59+
60+
**Variability:** MODERATE (±12% std dev)
61+
- Best: Project 0068_000434_000440: **101% efficiency** (43243s → 21484s)
62+
- Worst: Project 000192: **62% efficiency** (795s → 478s)
63+
64+
**CPU Utilization:** High on both (80% → 71%)
65+
66+
**Recommendation:****Use m3.xl (32c) - excellent value**
67+
68+
---
69+
70+
#### 3. build_point_cloud / buildPointCloud - **GOOD SCALING**
71+
72+
**Performance:** 78% efficiency (1.57x speedup)
73+
74+
**Variability:** MODERATE (±9% std dev)
75+
- Best: Project 0131_000015_000013: **91% efficiency**
76+
- Worst: Project 0068_000434_000440: **66% efficiency**
77+
78+
**CPU Utilization:** High (73% → 60%)
79+
80+
**Recommendation:****Use m3.xl (32c) acceptable, or bin-pack**
81+
82+
---
83+
84+
#### 4. align_cameras / alignCameras - **MODERATE SCALING**
85+
86+
**Performance:** 70% efficiency (1.41x speedup)
87+
88+
**Variability:** MODERATE (±7% std dev)
89+
- Range: 58% to 80% efficiency
90+
91+
**CPU Utilization:** Very high (77% → 69%) - this is CPU-intensive
92+
93+
**Recommendation:****Marginal benefit from 32c. Consider 16c or pack 2 jobs on 32c**
94+
95+
**Key Insight:** High CPU usage (77%) but only 70% efficiency suggests:
96+
- Process is trying to use all cores
97+
- But parallelization overhead limits speedup
98+
- **This is a prime candidate for bin-packing 2 jobs on m3.xl**
99+
100+
---
101+
102+
#### 5. build_depth_maps / buildDepthMaps - **POOR CPU SCALING** (GPU step)
103+
104+
**Performance:** 51% efficiency (1.02x speedup)
105+
106+
**CPU Utilization:** Very LOW (18% on both 16c and 32c)
107+
108+
**Why?** This is GPU-bound, not CPU-bound. Extra CPU cores don't help.
109+
110+
**Recommendation:****Definitely use 16c or bin-pack - wasting 32c here**
111+
112+
---
113+
114+
#### 6. build_dem_orthomosaic (all substeps) - **POOR SCALING**
115+
116+
**Performance:** 59% efficiency (1.19x speedup)
117+
118+
**CPU Utilization:** Very LOW (23% → 15%)
119+
120+
**Key Issue:** These steps don't parallelize well AND don't use many cores
121+
122+
**Recommendation:****Use 16c or bin-pack 2 jobs on 32c**
123+
124+
---
125+
126+
#### 7. setup / addPhotos - **NO SCALING BENEFIT**
127+
128+
**Performance:** 47% efficiency (0.94x speedup - SLOWER on 32c!)
129+
130+
**CPU Utilization:** Nearly zero (5% → 2%)
131+
132+
**Why?** I/O bound, not compute bound. Just loading data.
133+
134+
**Recommendation:****Use 16c or bin-pack many jobs**
135+
136+
---
137+
138+
## PART 2: Running 2 Jobs on m3.xl vs 1 Job on m3.large
139+
140+
### Analysis Summary
141+
142+
**Overall CPU utilization on 16c: 34.4%**
143+
**Steps with <50% CPU: 14/20 (70%)**
144+
145+
### **STRONG RECOMMENDATION: Run 2 jobs on m3.xl**
146+
147+
**Why this works:**
148+
1. Most steps use <50% CPU on 16 cores
149+
2. When 2 jobs run on 32 cores, each gets ~16 cores worth of CPU time
150+
3. Linux scheduler distributes fairly between processes
151+
4. Minimal interference expected
152+
153+
### Per-Step Bin-Packing Suitability
154+
155+
| Category | Steps | Can Pack 2 Jobs? | Reasoning |
156+
|----------|-------|-----------------|-----------|
157+
| **Safe** | setup, build_dem_orthomosaic (all), build_depth_maps, build_mesh (export), finalize |**YES** | CPU <50%, plenty headroom |
158+
| **Caution** | match_photos, align_cameras (optimize), build_mesh (buildModel) | **MAYBE** | CPU 50-70%, some contention risk |
159+
| **Avoid** | align_cameras (align), build_point_cloud (both) | **NO** | CPU >70%, likely contention |
160+
161+
### Expected Performance Impact
162+
163+
**Conservative estimate:** Each job will perform at **90-95%** of m3.large speed
164+
165+
**Reasoning:**
166+
- 70% of steps will run at 100% speed (CPU <50%)
167+
- 15% of steps may slow 10-20% (CPU 50-70%)
168+
- 15% of steps may slow 20-40% (CPU >70%)
169+
170+
**Weighted average:** ~90-95% performance with **2x throughput** = huge win!
171+
172+
### Cost-Benefit Analysis
173+
174+
| Configuration | Cost | Jobs/Instance | Throughput | Cost per Job |
175+
|--------------|------|---------------|------------|--------------|
176+
| m3.large (16c) | 1.0x | 1 | 1.0x | 1.0x |
177+
| m3.xl single (32c) | 1.5x | 1 | 1.0x | 1.5x ❌ |
178+
| m3.xl dual (32c) | 1.5x | 2 | 1.8-1.9x | **0.79-0.83x**|
179+
180+
**Conclusion:** Running 2 jobs on m3.xl gives **15-20% cost savings** per job!
181+
182+
---
183+
184+
## PART 3: MIG GPU Scaling
185+
186+
### Key Findings by Step
187+
188+
#### build_depth_maps / buildDepthMaps ⭐ **EXCEPTIONAL MIG PERFORMANCE**
189+
190+
**MIG Scaling Efficiency (adding slices):**
191+
- 1g → 2g: **79% efficiency** (1.57x speedup)
192+
- 1g → 3g: **67% efficiency** (2.00x speedup)
193+
194+
**Multiple small vs single large:**
195+
- 2×1g vs 1×2g: **2×1g is 8% faster**
196+
- 3×1g vs 1×3g: **3×1g is 15% faster**
197+
198+
**MIG vs Full GPU - EXCEPTIONAL RESULTS:**
199+
200+
| Config | Expected Slowdown | Actual Slowdown | Efficiency |
201+
|--------|------------------|-----------------|-----------|
202+
| 1×1g (1/7 GPU) | 7.0x | **2.97x** | **236%**|
203+
| 1×2g (2/7 GPU) | 3.5x | **1.89x** | **185%**|
204+
| 1×3g (3/7 GPU) | 2.33x | **1.49x** | **157%**|
205+
| 2×1g (2/7 GPU) | 3.5x | **1.73x** | **202%**|
206+
| 3×1g (3/7 GPU) | 2.33x | **1.26x** | **185%**|
207+
208+
**Interpretation:**
209+
- A 1/7 GPU slice is only **3x slower** instead of 7x slower!
210+
- MIG isolation overhead is **minimal to non-existent**
211+
- Workload is NOT memory bandwidth limited
212+
- **All MIG configs perform 157-236% better than expected**
213+
214+
**Variability:** LOW (±5-14% std dev) - very consistent across projects
215+
216+
---
217+
218+
#### match_photos / matchPhotos ⭐ **EVEN BETTER MIG PERFORMANCE**
219+
220+
**MIG vs Full GPU - ASTONISHING RESULTS:**
221+
222+
| Config | Expected Slowdown | Actual Slowdown | Efficiency |
223+
|--------|------------------|-----------------|-----------|
224+
| 1×1g (1/7 GPU) | 7.0x | **1.53x** | **463%** 🚀 |
225+
| 1×2g (2/7 GPU) | 3.5x | **1.16x** | **303%** 🚀 |
226+
| 1×3g (3/7 GPU) | 2.33x | **1.09x** | **215%** 🚀 |
227+
| 2×1g (2/7 GPU) | 3.5x | **0.99x** | **355%** 🚀 |
228+
| 3×1g (3/7 GPU) | 2.33x | **0.89x** | **263%** 🚀 |
229+
230+
**INCREDIBLE:**
231+
- **2×1g is same speed as full GPU!** (0.99x)
232+
- **3×1g is actually FASTER than full GPU!** (0.89x)
233+
- This step is NOT very GPU-intensive, minimal GPU slicing matters
234+
235+
**Multiple small vs single large:**
236+
- 3×1g vs 1×3g: **3×1g is 18% faster**
237+
238+
**Recommendation:** For this step, MIG slicing is **incredibly efficient**
239+
240+
---
241+
242+
#### Other Steps (align_cameras, setup, finalize)
243+
244+
**Performance:** Moderate to poor GPU scaling
245+
- These are CPU-bound, not GPU-bound
246+
- MIG efficiency: 30-50%
247+
- But they don't use much GPU anyway
248+
249+
**Not a concern** - these steps shouldn't be using GPU nodes
250+
251+
---
252+
253+
### Overall MIG Recommendations
254+
255+
#### 1. **3×1g vs 1×3g: Use 3×1g**
256+
257+
**Reasoning:**
258+
- 3×1g is **15% faster** for buildDepthMaps
259+
- 3×1g is **18% faster** for matchPhotos
260+
- Performance is equivalent or better across all steps
261+
- Better scheduling flexibility
262+
263+
#### 2. **2×1g vs 1×2g: Use 2×1g**
264+
265+
**Reasoning:**
266+
- 2×1g is **8% faster** for buildDepthMaps
267+
- 2×1g is **14% faster** for matchPhotos
268+
- Slight performance advantage
269+
270+
#### 3. **MIG Slicing is HIGHLY EFFICIENT**
271+
272+
**Key Finding:** Even 1/7 GPU slices perform **2-3x better than linear scaling**
273+
274+
**Practical Implication:**
275+
- You can run **3x more jobs** with 3×(1g.5gb) slices
276+
- Each job is only **1.5x slower** (not 3x slower!)
277+
- **Net throughput: 2x improvement** with MIG slicing
278+
279+
**Why this matters:**
280+
- Cost efficiency: Run more jobs per GPU
281+
- Scheduling: Better bin-packing
282+
- Utilization: No wasted GPU capacity
283+
284+
---
285+
286+
## Overall Recommendations
287+
288+
### For CPU Workloads:
289+
290+
1. **For single jobs:**
291+
- Use **m3.large (16c)** for most steps
292+
- Only use m3.xl for matchPhotos step specifically
293+
- Overall: **m3.large is better value**
294+
295+
2. **For maximum throughput:**
296+
-**Run 2 parallel jobs on m3.xl (32c)**
297+
- Expected: 90-95% performance per job
298+
- Benefit: 1.8x throughput, 15-20% cost savings per job
299+
- **This is the recommended approach**
300+
301+
### For GPU Workloads:
302+
303+
1. **Use MIG slicing aggressively**
304+
- Prefer **multiple small slices over single large slices** (3×1g > 1×3g)
305+
- Even 1/7 slices are highly efficient
306+
- No performance penalty, often performance gain
307+
308+
2. **MIG scaling is exceptional**
309+
- 150-463% efficiency vs linear scaling
310+
- Workload is NOT bandwidth limited
311+
- MIG overhead is negligible
312+
313+
3. **Schedule based on flexibility, not performance**
314+
- All MIG configs perform well
315+
- Choose based on what fits your scheduler best
316+
317+
### Variability Across Projects
318+
319+
**High variability steps** (>15% std dev):
320+
- matchPhotos (CPU): ±27% - depends on image similarity
321+
- build_mesh/buildModel: ±20% - depends on point cloud density
322+
323+
**Low variability steps** (<10% std dev):
324+
- Most steps: ±5-10% - very consistent
325+
- MIG performance: ±5-14% - reliable scaling
326+
327+
**Conclusion:** Results are generally consistent. Plan based on averages.

0 commit comments

Comments
 (0)