|
| 1 | +# Photogrammetry Workflow Scaling Analysis - Key Findings |
| 2 | + |
| 3 | +## Understanding Efficiency Metrics |
| 4 | + |
| 5 | +**Efficiency = (Speedup / CoreRatio) × 100%** |
| 6 | + |
| 7 | +- **100% = Perfect Scaling**: Doubling cores doubles speed |
| 8 | +- **>100% = Super-linear**: Better than expected (cache effects, memory bandwidth) |
| 9 | +- **<100% = Sub-linear**: Parallelization overhead |
| 10 | + |
| 11 | +**Why can efficiency exceed 100%?** |
| 12 | +1. **Cache effects**: More cores = more L3 cache |
| 13 | +2. **Memory bandwidth**: Better utilization with more cores |
| 14 | +3. **NUMA locality**: Better memory placement |
| 15 | +4. **Reduced contention**: Less lock contention per core |
| 16 | + |
| 17 | +## PART 1: CPU Scaling Analysis (16c vs 32c) |
| 18 | + |
| 19 | +### Summary by Step Category |
| 20 | + |
| 21 | +| Step Type | Avg Efficiency | Avg Speedup | CPU% (16c) | CPU% (32c) | Recommendation | |
| 22 | +|-----------|---------------|-------------|------------|------------|----------------| |
| 23 | +| **match_photos** | 115% ✓ | 2.30x | 57% | 8% | **Use 32c - super-linear!** | |
| 24 | +| **build_point_cloud** (buildPointCloud) | 78% | 1.57x | 73% | 60% | Use 32c or bin-pack | |
| 25 | +| **build_point_cloud** (classifyGroundPoints) | 85% ✓ | 1.70x | 80% | 71% | **Use 32c - good scaling** | |
| 26 | +| **align_cameras** (alignCameras) | 70% | 1.41x | 77% | 69% | Consider 16c or bin-pack | |
| 27 | +| **align_cameras** (optimizeCameras) | 62% | 1.23x | 62% | 48% | Consider 16c or bin-pack | |
| 28 | +| **build_mesh** (buildModel) | 69% | 1.37x | 52% | 36% | Consider 16c or bin-pack | |
| 29 | +| **build_dem_orthomosaic** | 59% | 1.19x | 23% | 15% | **Use 16c or bin-pack** | |
| 30 | +| **build_depth_maps** | 51% | 1.02x | 18% | 18% | **Use 16c or bin-pack** | |
| 31 | +| **setup** | 47% | 0.94x | 5% | 2% | **Use 16c or bin-pack** | |
| 32 | + |
| 33 | +### Detailed Step-by-Step Findings |
| 34 | + |
| 35 | +#### 1. match_photos / matchPhotos ⭐ **SUPER-LINEAR SCALING** |
| 36 | + |
| 37 | +**Performance:** 115% efficiency (2.30x speedup) |
| 38 | + |
| 39 | +**Key Finding:** This is the ONLY step that shows super-linear scaling! |
| 40 | + |
| 41 | +**Variability:** HIGH (±27% std dev) - performance depends on dataset: |
| 42 | +- Best: Project 000195: **137% efficiency** (2300s → 838s) |
| 43 | +- Worst: benchmarking-emerald-subset: **56% efficiency** (64s → 57s) |
| 44 | + |
| 45 | +**Why super-linear?** |
| 46 | +- CPU% drops from 57% to 8% on 32c, indicating this is NOT compute-bound |
| 47 | +- Likely memory/cache bound - benefits from 2x more L3 cache on 32-core system |
| 48 | +- Algorithm parallelizes extremely well with more cores |
| 49 | + |
| 50 | +**Recommendation:** ✓ **Strongly use m3.xl (32c) for this step** |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +#### 2. build_point_cloud / classifyGroundPoints ⭐ **EXCELLENT SCALING** |
| 55 | + |
| 56 | +**Performance:** 85% efficiency (1.70x speedup) |
| 57 | + |
| 58 | +**Key Finding:** Near-linear scaling, one project achieved perfect 100% |
| 59 | + |
| 60 | +**Variability:** MODERATE (±12% std dev) |
| 61 | +- Best: Project 0068_000434_000440: **101% efficiency** (43243s → 21484s) |
| 62 | +- Worst: Project 000192: **62% efficiency** (795s → 478s) |
| 63 | + |
| 64 | +**CPU Utilization:** High on both (80% → 71%) |
| 65 | + |
| 66 | +**Recommendation:** ✓ **Use m3.xl (32c) - excellent value** |
| 67 | + |
| 68 | +--- |
| 69 | + |
| 70 | +#### 3. build_point_cloud / buildPointCloud - **GOOD SCALING** |
| 71 | + |
| 72 | +**Performance:** 78% efficiency (1.57x speedup) |
| 73 | + |
| 74 | +**Variability:** MODERATE (±9% std dev) |
| 75 | +- Best: Project 0131_000015_000013: **91% efficiency** |
| 76 | +- Worst: Project 0068_000434_000440: **66% efficiency** |
| 77 | + |
| 78 | +**CPU Utilization:** High (73% → 60%) |
| 79 | + |
| 80 | +**Recommendation:** → **Use m3.xl (32c) acceptable, or bin-pack** |
| 81 | + |
| 82 | +--- |
| 83 | + |
| 84 | +#### 4. align_cameras / alignCameras - **MODERATE SCALING** |
| 85 | + |
| 86 | +**Performance:** 70% efficiency (1.41x speedup) |
| 87 | + |
| 88 | +**Variability:** MODERATE (±7% std dev) |
| 89 | +- Range: 58% to 80% efficiency |
| 90 | + |
| 91 | +**CPU Utilization:** Very high (77% → 69%) - this is CPU-intensive |
| 92 | + |
| 93 | +**Recommendation:** ⚠ **Marginal benefit from 32c. Consider 16c or pack 2 jobs on 32c** |
| 94 | + |
| 95 | +**Key Insight:** High CPU usage (77%) but only 70% efficiency suggests: |
| 96 | +- Process is trying to use all cores |
| 97 | +- But parallelization overhead limits speedup |
| 98 | +- **This is a prime candidate for bin-packing 2 jobs on m3.xl** |
| 99 | + |
| 100 | +--- |
| 101 | + |
| 102 | +#### 5. build_depth_maps / buildDepthMaps - **POOR CPU SCALING** (GPU step) |
| 103 | + |
| 104 | +**Performance:** 51% efficiency (1.02x speedup) |
| 105 | + |
| 106 | +**CPU Utilization:** Very LOW (18% on both 16c and 32c) |
| 107 | + |
| 108 | +**Why?** This is GPU-bound, not CPU-bound. Extra CPU cores don't help. |
| 109 | + |
| 110 | +**Recommendation:** ✓ **Definitely use 16c or bin-pack - wasting 32c here** |
| 111 | + |
| 112 | +--- |
| 113 | + |
| 114 | +#### 6. build_dem_orthomosaic (all substeps) - **POOR SCALING** |
| 115 | + |
| 116 | +**Performance:** 59% efficiency (1.19x speedup) |
| 117 | + |
| 118 | +**CPU Utilization:** Very LOW (23% → 15%) |
| 119 | + |
| 120 | +**Key Issue:** These steps don't parallelize well AND don't use many cores |
| 121 | + |
| 122 | +**Recommendation:** ✓ **Use 16c or bin-pack 2 jobs on 32c** |
| 123 | + |
| 124 | +--- |
| 125 | + |
| 126 | +#### 7. setup / addPhotos - **NO SCALING BENEFIT** |
| 127 | + |
| 128 | +**Performance:** 47% efficiency (0.94x speedup - SLOWER on 32c!) |
| 129 | + |
| 130 | +**CPU Utilization:** Nearly zero (5% → 2%) |
| 131 | + |
| 132 | +**Why?** I/O bound, not compute bound. Just loading data. |
| 133 | + |
| 134 | +**Recommendation:** ✓ **Use 16c or bin-pack many jobs** |
| 135 | + |
| 136 | +--- |
| 137 | + |
| 138 | +## PART 2: Running 2 Jobs on m3.xl vs 1 Job on m3.large |
| 139 | + |
| 140 | +### Analysis Summary |
| 141 | + |
| 142 | +**Overall CPU utilization on 16c: 34.4%** |
| 143 | +**Steps with <50% CPU: 14/20 (70%)** |
| 144 | + |
| 145 | +### ✓ **STRONG RECOMMENDATION: Run 2 jobs on m3.xl** |
| 146 | + |
| 147 | +**Why this works:** |
| 148 | +1. Most steps use <50% CPU on 16 cores |
| 149 | +2. When 2 jobs run on 32 cores, each gets ~16 cores worth of CPU time |
| 150 | +3. Linux scheduler distributes fairly between processes |
| 151 | +4. Minimal interference expected |
| 152 | + |
| 153 | +### Per-Step Bin-Packing Suitability |
| 154 | + |
| 155 | +| Category | Steps | Can Pack 2 Jobs? | Reasoning | |
| 156 | +|----------|-------|-----------------|-----------| |
| 157 | +| **Safe** | setup, build_dem_orthomosaic (all), build_depth_maps, build_mesh (export), finalize | ✓ **YES** | CPU <50%, plenty headroom | |
| 158 | +| **Caution** | match_photos, align_cameras (optimize), build_mesh (buildModel) | **MAYBE** | CPU 50-70%, some contention risk | |
| 159 | +| **Avoid** | align_cameras (align), build_point_cloud (both) | **NO** | CPU >70%, likely contention | |
| 160 | + |
| 161 | +### Expected Performance Impact |
| 162 | + |
| 163 | +**Conservative estimate:** Each job will perform at **90-95%** of m3.large speed |
| 164 | + |
| 165 | +**Reasoning:** |
| 166 | +- 70% of steps will run at 100% speed (CPU <50%) |
| 167 | +- 15% of steps may slow 10-20% (CPU 50-70%) |
| 168 | +- 15% of steps may slow 20-40% (CPU >70%) |
| 169 | + |
| 170 | +**Weighted average:** ~90-95% performance with **2x throughput** = huge win! |
| 171 | + |
| 172 | +### Cost-Benefit Analysis |
| 173 | + |
| 174 | +| Configuration | Cost | Jobs/Instance | Throughput | Cost per Job | |
| 175 | +|--------------|------|---------------|------------|--------------| |
| 176 | +| m3.large (16c) | 1.0x | 1 | 1.0x | 1.0x | |
| 177 | +| m3.xl single (32c) | 1.5x | 1 | 1.0x | 1.5x ❌ | |
| 178 | +| m3.xl dual (32c) | 1.5x | 2 | 1.8-1.9x | **0.79-0.83x** ✓ | |
| 179 | + |
| 180 | +**Conclusion:** Running 2 jobs on m3.xl gives **15-20% cost savings** per job! |
| 181 | + |
| 182 | +--- |
| 183 | + |
| 184 | +## PART 3: MIG GPU Scaling |
| 185 | + |
| 186 | +### Key Findings by Step |
| 187 | + |
| 188 | +#### build_depth_maps / buildDepthMaps ⭐ **EXCEPTIONAL MIG PERFORMANCE** |
| 189 | + |
| 190 | +**MIG Scaling Efficiency (adding slices):** |
| 191 | +- 1g → 2g: **79% efficiency** (1.57x speedup) |
| 192 | +- 1g → 3g: **67% efficiency** (2.00x speedup) |
| 193 | + |
| 194 | +**Multiple small vs single large:** |
| 195 | +- 2×1g vs 1×2g: **2×1g is 8% faster** ✓ |
| 196 | +- 3×1g vs 1×3g: **3×1g is 15% faster** ✓ |
| 197 | + |
| 198 | +**MIG vs Full GPU - EXCEPTIONAL RESULTS:** |
| 199 | + |
| 200 | +| Config | Expected Slowdown | Actual Slowdown | Efficiency | |
| 201 | +|--------|------------------|-----------------|-----------| |
| 202 | +| 1×1g (1/7 GPU) | 7.0x | **2.97x** | **236%** ⭐ | |
| 203 | +| 1×2g (2/7 GPU) | 3.5x | **1.89x** | **185%** ⭐ | |
| 204 | +| 1×3g (3/7 GPU) | 2.33x | **1.49x** | **157%** ⭐ | |
| 205 | +| 2×1g (2/7 GPU) | 3.5x | **1.73x** | **202%** ⭐ | |
| 206 | +| 3×1g (3/7 GPU) | 2.33x | **1.26x** | **185%** ⭐ | |
| 207 | + |
| 208 | +**Interpretation:** |
| 209 | +- A 1/7 GPU slice is only **3x slower** instead of 7x slower! |
| 210 | +- MIG isolation overhead is **minimal to non-existent** |
| 211 | +- Workload is NOT memory bandwidth limited |
| 212 | +- **All MIG configs perform 157-236% better than expected** |
| 213 | + |
| 214 | +**Variability:** LOW (±5-14% std dev) - very consistent across projects |
| 215 | + |
| 216 | +--- |
| 217 | + |
| 218 | +#### match_photos / matchPhotos ⭐ **EVEN BETTER MIG PERFORMANCE** |
| 219 | + |
| 220 | +**MIG vs Full GPU - ASTONISHING RESULTS:** |
| 221 | + |
| 222 | +| Config | Expected Slowdown | Actual Slowdown | Efficiency | |
| 223 | +|--------|------------------|-----------------|-----------| |
| 224 | +| 1×1g (1/7 GPU) | 7.0x | **1.53x** | **463%** 🚀 | |
| 225 | +| 1×2g (2/7 GPU) | 3.5x | **1.16x** | **303%** 🚀 | |
| 226 | +| 1×3g (3/7 GPU) | 2.33x | **1.09x** | **215%** 🚀 | |
| 227 | +| 2×1g (2/7 GPU) | 3.5x | **0.99x** | **355%** 🚀 | |
| 228 | +| 3×1g (3/7 GPU) | 2.33x | **0.89x** | **263%** 🚀 | |
| 229 | + |
| 230 | +**INCREDIBLE:** |
| 231 | +- **2×1g is same speed as full GPU!** (0.99x) |
| 232 | +- **3×1g is actually FASTER than full GPU!** (0.89x) |
| 233 | +- This step is NOT very GPU-intensive, minimal GPU slicing matters |
| 234 | + |
| 235 | +**Multiple small vs single large:** |
| 236 | +- 3×1g vs 1×3g: **3×1g is 18% faster** ✓ |
| 237 | + |
| 238 | +**Recommendation:** For this step, MIG slicing is **incredibly efficient** |
| 239 | + |
| 240 | +--- |
| 241 | + |
| 242 | +#### Other Steps (align_cameras, setup, finalize) |
| 243 | + |
| 244 | +**Performance:** Moderate to poor GPU scaling |
| 245 | +- These are CPU-bound, not GPU-bound |
| 246 | +- MIG efficiency: 30-50% |
| 247 | +- But they don't use much GPU anyway |
| 248 | + |
| 249 | +**Not a concern** - these steps shouldn't be using GPU nodes |
| 250 | + |
| 251 | +--- |
| 252 | + |
| 253 | +### Overall MIG Recommendations |
| 254 | + |
| 255 | +#### 1. **3×1g vs 1×3g: Use 3×1g** ✓ |
| 256 | + |
| 257 | +**Reasoning:** |
| 258 | +- 3×1g is **15% faster** for buildDepthMaps |
| 259 | +- 3×1g is **18% faster** for matchPhotos |
| 260 | +- Performance is equivalent or better across all steps |
| 261 | +- Better scheduling flexibility |
| 262 | + |
| 263 | +#### 2. **2×1g vs 1×2g: Use 2×1g** ✓ |
| 264 | + |
| 265 | +**Reasoning:** |
| 266 | +- 2×1g is **8% faster** for buildDepthMaps |
| 267 | +- 2×1g is **14% faster** for matchPhotos |
| 268 | +- Slight performance advantage |
| 269 | + |
| 270 | +#### 3. **MIG Slicing is HIGHLY EFFICIENT** ⭐ |
| 271 | + |
| 272 | +**Key Finding:** Even 1/7 GPU slices perform **2-3x better than linear scaling** |
| 273 | + |
| 274 | +**Practical Implication:** |
| 275 | +- You can run **3x more jobs** with 3×(1g.5gb) slices |
| 276 | +- Each job is only **1.5x slower** (not 3x slower!) |
| 277 | +- **Net throughput: 2x improvement** with MIG slicing |
| 278 | + |
| 279 | +**Why this matters:** |
| 280 | +- Cost efficiency: Run more jobs per GPU |
| 281 | +- Scheduling: Better bin-packing |
| 282 | +- Utilization: No wasted GPU capacity |
| 283 | + |
| 284 | +--- |
| 285 | + |
| 286 | +## Overall Recommendations |
| 287 | + |
| 288 | +### For CPU Workloads: |
| 289 | + |
| 290 | +1. **For single jobs:** |
| 291 | + - Use **m3.large (16c)** for most steps |
| 292 | + - Only use m3.xl for matchPhotos step specifically |
| 293 | + - Overall: **m3.large is better value** |
| 294 | + |
| 295 | +2. **For maximum throughput:** |
| 296 | + - ✓ **Run 2 parallel jobs on m3.xl (32c)** |
| 297 | + - Expected: 90-95% performance per job |
| 298 | + - Benefit: 1.8x throughput, 15-20% cost savings per job |
| 299 | + - **This is the recommended approach** |
| 300 | + |
| 301 | +### For GPU Workloads: |
| 302 | + |
| 303 | +1. **Use MIG slicing aggressively** |
| 304 | + - Prefer **multiple small slices over single large slices** (3×1g > 1×3g) |
| 305 | + - Even 1/7 slices are highly efficient |
| 306 | + - No performance penalty, often performance gain |
| 307 | + |
| 308 | +2. **MIG scaling is exceptional** |
| 309 | + - 150-463% efficiency vs linear scaling |
| 310 | + - Workload is NOT bandwidth limited |
| 311 | + - MIG overhead is negligible |
| 312 | + |
| 313 | +3. **Schedule based on flexibility, not performance** |
| 314 | + - All MIG configs perform well |
| 315 | + - Choose based on what fits your scheduler best |
| 316 | + |
| 317 | +### Variability Across Projects |
| 318 | + |
| 319 | +**High variability steps** (>15% std dev): |
| 320 | +- matchPhotos (CPU): ±27% - depends on image similarity |
| 321 | +- build_mesh/buildModel: ±20% - depends on point cloud density |
| 322 | + |
| 323 | +**Low variability steps** (<10% std dev): |
| 324 | +- Most steps: ±5-10% - very consistent |
| 325 | +- MIG performance: ±5-14% - reliable scaling |
| 326 | + |
| 327 | +**Conclusion:** Results are generally consistent. Plan based on averages. |
0 commit comments