-
Notifications
You must be signed in to change notification settings - Fork 622
Commit 2269160
Script for benchmark stability assessment (#10982)
### Summary
The custom script for ET benchmark stability assessment.
```
pip install openpyxl tabulate matplotlib
```
Then
```
python .ci/scripts/analyze_benchmark_stability.py \
Benchmark\ Dataset\ with\ Private\ AWS\ Devices.xlsx \
--reference_file Benchmark\ Dataset\ with\ Public\ AWS\ Devices.xlsx
```
Datasets:
- [Benchmark Dataset with Private AWS
Devices.xlsx](https://github.com/user-attachments/files/20596657/Benchmark.Dataset.with.Private.AWS.Devices.xlsx)
- [Benchmark Dataset with Public AWS
Devices.xlsx](https://github.com/user-attachments/files/20596666/Benchmark.Dataset.with.Public.AWS.Devices.xlsx)
The generated analysis:
```
Analyzing latency stability from primary file: /Users/guangyang/Desktop/Benchmark Dataset with Private AWS Devices.xlsx
Using reference file for comparison: /Users/guangyang/Desktop/Benchmark Dataset with Public AWS Devices.xlsx
====================================================================================================
===== LOADING PRIMARY DATASETS (Private) ==========================================================
====================================================================================================
Loading dataset: llama3_qlora+s22_android13
Loading dataset: llama3_spinq+s22_android13
Loading dataset: mv3_qnn+s22_android13
Loading dataset: mv3_xnnq8+s22_android13
Loading dataset: llama3_qlora+s22ultra_android14
Loading dataset: llama3_spinq+s22ultra_android14
Loading dataset: mv3_qnn+s22ultra_android14
Loading dataset: mv3_xnnq8+s22ultra_android14
Loading dataset: mv3_xnnq8+pixel3_rooted_android
Loading dataset: llama3_qlora+iphone15max_ios17
Loading dataset: llama3_spinq+iphone15max_ios17
Loading dataset: mv3_xnnq8+iphone15max_ios17
Loading dataset: mv3_coreml+iphone15max_ios17
Loading dataset: mv3_mps+iphone15max_ios17
Loading dataset: llama3_qlora+iphone15_ios18
Loading dataset: llama3_spinq+iphone15_ios18
Loading dataset: mv3_xnnq8+iphone15_ios18
Loading dataset: mv3_coreml+iphone15_ios18
Loading dataset: mv3_mps+iphone15_ios18
====================================================================================================
===== LOADING REFERENCE DATASETS (Public) =========================================================
====================================================================================================
Loading reference dataset: llama3_qlora+s22_android13
Loading reference dataset: llama3_spinq+s22_android13
Loading reference dataset: mv3_qnn+s22_android13
Loading reference dataset: mv3_xnnq8+s22_android13
Loading reference dataset: llama3_spinq+s22_android12
Loading reference dataset: llama3_qlora+s22Ultra5G_android
Loading reference dataset: llama3_spinq+s22ultra_android12
Loading reference dataset: mv3_xnnq8+s22ultra_android12
Loading reference dataset: mv3_qnn+s22ultra_android12
Loading reference dataset: llama3_qlora+iphone15max_ios17
Loading reference dataset: llama3_spinq+iphone15max_ios17
Loading reference dataset: mv3_xnnq8+iphone15max_ios17
Loading reference dataset: mv3_coreml+iphone15max_ios17
Loading reference dataset: mv3_mps+iphone15max_ios17
Loading reference dataset: llama3_qlora+iphone15_ios18
Loading reference dataset: llama3_spinq+iphone15_ios18
Loading reference dataset: mv3_xnnq8+iphone15_ios18
Loading reference dataset: mv3_coreml+iphone15_ios18
Loading reference dataset: mv3_mps+iphone15_ios18
====================================================================================================
===== ANALYZING PRIMARY DATASETS ==================================================================
====================================================================================================
Latency Stability Analysis: llama3_qlora+s22_android13 (Primary)
================================================================================
Model: llama3_qlora
Device: s22_android13
Dataset Overview:
- Number of samples: 88
- Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00
Central Tendency Metrics:
- Mean latency: 22502.10 ms
- Median latency (P50): 22447.56 ms
- Mean trimmed latency: 22388.87 ms
- Median trimmed latency: 22343.47 ms
Dispersion Metrics:
- Standard deviation: 595.01 ms
- Coefficient of variation (CV): 2.64%
- Interquartile range (IQR): 858.26 ms
- Trimmed standard deviation: 596.25 ms
- Trimmed coefficient of variation: 2.66%
Percentile Metrics:
- P50 (median): 22447.56 ms
- P90: 23231.99 ms
- P95: 23518.35 ms
- P99: 23910.11 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.1423
- P99/P50 ratio: 1.0652
- Mean rolling std (window=5): 539.36 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 0.50%
- Max trimming effect ratio: 0.81%
Throughput Metrics:
- Mean TPS: 33.07
- TPS coefficient of variation: 6.92%
Stability Assessment:
- Overall stability score: 83.4/100
- Overall stability rating: Good
Interpretation:
The benchmark shows good stability (score: 83.4/100) with low
variation between runs (CV: 2.64%).
Performance is consistent and predictable for most use cases.
================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+s22_android13_primary_time_series.png
Latency Stability Analysis: llama3_spinq+s22_android13 (Primary)
================================================================================
Model: llama3_spinq
Device: s22_android13
Dataset Overview:
- Number of samples: 88
- Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00
Central Tendency Metrics:
- Mean latency: 21771.59 ms
- Median latency (P50): 21668.24 ms
- Mean trimmed latency: 21662.53 ms
- Median trimmed latency: 21559.89 ms
Dispersion Metrics:
- Standard deviation: 514.89 ms
- Coefficient of variation (CV): 2.36%
- Interquartile range (IQR): 602.75 ms
- Trimmed standard deviation: 515.03 ms
- Trimmed coefficient of variation: 2.38%
Percentile Metrics:
- P50 (median): 21668.24 ms
- P90: 22438.74 ms
- P95: 22542.42 ms
- P99: 23104.76 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.1452
- P99/P50 ratio: 1.0663
- Mean rolling std (window=5): 449.10 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 0.50%
- Max trimming effect ratio: 0.89%
Throughput Metrics:
- Mean TPS: 33.76
- TPS coefficient of variation: 4.70%
Stability Assessment:
- Overall stability score: 84.7/100
- Overall stability rating: Good
Interpretation:
The benchmark shows good stability (score: 84.7/100) with low
variation between runs (CV: 2.36%).
Performance is consistent and predictable for most use cases.
================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+s22_android13_primary_time_series.png
Latency Stability Analysis: mv3_qnn+s22_android13 (Primary)
================================================================================
Model: mv3_qnn
Device: s22_android13
Dataset Overview:
- Number of samples: 100
- Date range: 2025-04-29 09:48:57+00:00 to 2025-05-15 21:14:41+00:00
Central Tendency Metrics:
- Mean latency: 1.01 ms
- Median latency (P50): 1.00 ms
- Mean trimmed latency: 1.00 ms
- Median trimmed latency: 1.00 ms
Dispersion Metrics:
- Standard deviation: 0.02 ms
- Coefficient of variation (CV): 2.34%
- Interquartile range (IQR): 0.01 ms
- Trimmed standard deviation: 0.02 ms
- Trimmed coefficient of variation: 2.27%
Percentile Metrics:
- P50 (median): 1.00 ms
- P90: 1.01 ms
- P95: 1.01 ms
- P99: 1.14 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.1919
- P99/P50 ratio: 1.1404
- Mean rolling std (window=5): 0.01 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 0.19%
- Max trimming effect ratio: 1.00%
Stability Assessment:
- Overall stability score: 82.4/100
- Overall stability rating: Good
Interpretation:
The benchmark shows good stability (score: 82.4/100) with low
variation between runs (CV: 2.34%).
Performance is consistent and predictable for most use cases.
The P99/P50 ratio of 1.14 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_qnn+s22_android13_primary_time_series.png
Latency Stability Analysis: mv3_xnnq8+s22_android13 (Primary)
================================================================================
Model: mv3_xnnq8
Device: s22_android13
Dataset Overview:
- Number of samples: 88
- Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00
Central Tendency Metrics:
- Mean latency: 2.73 ms
- Median latency (P50): 2.65 ms
- Mean trimmed latency: 2.22 ms
- Median trimmed latency: 2.10 ms
Dispersion Metrics:
- Standard deviation: 0.63 ms
- Coefficient of variation (CV): 23.03%
- Interquartile range (IQR): 0.95 ms
- Trimmed standard deviation: 0.36 ms
- Trimmed coefficient of variation: 15.98%
Percentile Metrics:
- P50 (median): 2.65 ms
- P90: 3.59 ms
- P95: 3.74 ms
- P99: 4.46 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 2.4427
- P99/P50 ratio: 1.6812
- Mean rolling std (window=5): 0.60 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 16.52%
- Max trimming effect ratio: 36.96%
Stability Assessment:
- Overall stability score: 14.9/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 14.9/100) with significant
variation between runs (CV: 23.03%).
Performance is unpredictable and may lead to inconsistent user experience.
The significant difference between raw and trimmed means suggests
considerable intra-run jitter (16.5%) with occasional outliers within benchmark runs.
The max/min ratio of 2.44 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.68 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+s22_android13_primary_time_series.png
Latency Stability Analysis: llama3_qlora+s22ultra_android14 (Primary)
================================================================================
Model: llama3_qlora
Device: s22ultra_android14
Dataset Overview:
- Number of samples: 88
- Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00
Central Tendency Metrics:
- Mean latency: 25022.84 ms
- Median latency (P50): 25427.33 ms
- Mean trimmed latency: 24748.06 ms
- Median trimmed latency: 25062.01 ms
Dispersion Metrics:
- Standard deviation: 1545.62 ms
- Coefficient of variation (CV): 6.18%
- Interquartile range (IQR): 2844.11 ms
- Trimmed standard deviation: 1467.60 ms
- Trimmed coefficient of variation: 5.93%
Percentile Metrics:
- P50 (median): 25427.33 ms
- P90: 26581.31 ms
- P95: 27184.07 ms
- P99: 28668.97 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.2710
- P99/P50 ratio: 1.1275
- Mean rolling std (window=5): 1560.71 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 1.08%
- Max trimming effect ratio: 4.80%
Throughput Metrics:
- Mean TPS: 28.35
- TPS coefficient of variation: 7.88%
Stability Assessment:
- Overall stability score: 62.5/100
- Overall stability rating: Moderate
Interpretation:
The benchmark shows moderate stability (score: 62.5/100) with noticeable
variation between runs (CV: 6.18%).
While average performance is acceptable, occasional latency spikes may occur.
The max/min ratio of 1.27 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.13 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+s22ultra_android14_primary_time_series.png
Latency Stability Analysis: llama3_spinq+s22ultra_android14 (Primary)
================================================================================
Model: llama3_spinq
Device: s22ultra_android14
Dataset Overview:
- Number of samples: 88
- Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00
Central Tendency Metrics:
- Mean latency: 24761.78 ms
- Median latency (P50): 25043.89 ms
- Mean trimmed latency: 24466.21 ms
- Median trimmed latency: 24731.04 ms
Dispersion Metrics:
- Standard deviation: 1552.25 ms
- Coefficient of variation (CV): 6.27%
- Interquartile range (IQR): 1931.42 ms
- Trimmed standard deviation: 1466.19 ms
- Trimmed coefficient of variation: 5.99%
Percentile Metrics:
- P50 (median): 25043.89 ms
- P90: 26163.60 ms
- P95: 26948.68 ms
- P99: 28868.51 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.3648
- P99/P50 ratio: 1.1527
- Mean rolling std (window=5): 1451.05 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 1.17%
- Max trimming effect ratio: 4.90%
Throughput Metrics:
- Mean TPS: 29.85
- TPS coefficient of variation: 8.24%
Stability Assessment:
- Overall stability score: 60.3/100
- Overall stability rating: Moderate
Interpretation:
The benchmark shows moderate stability (score: 60.3/100) with noticeable
variation between runs (CV: 6.27%).
While average performance is acceptable, occasional latency spikes may occur.
The max/min ratio of 1.36 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.15 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+s22ultra_android14_primary_time_series.png
Latency Stability Analysis: mv3_qnn+s22ultra_android14 (Primary)
================================================================================
Model: mv3_qnn
Device: s22ultra_android14
Dataset Overview:
- Number of samples: 100
- Date range: 2025-04-29 09:48:57+00:00 to 2025-05-15 21:14:41+00:00
Central Tendency Metrics:
- Mean latency: 1.01 ms
- Median latency (P50): 1.01 ms
- Mean trimmed latency: 1.01 ms
- Median trimmed latency: 1.01 ms
Dispersion Metrics:
- Standard deviation: 0.01 ms
- Coefficient of variation (CV): 0.91%
- Interquartile range (IQR): 0.01 ms
- Trimmed standard deviation: 0.01 ms
- Trimmed coefficient of variation: 0.70%
Percentile Metrics:
- P50 (median): 1.01 ms
- P90: 1.02 ms
- P95: 1.02 ms
- P99: 1.03 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.0900
- P99/P50 ratio: 1.0204
- Mean rolling std (window=5): 0.01 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 0.19%
- Max trimming effect ratio: 1.94%
Stability Assessment:
- Overall stability score: 93.8/100
- Overall stability rating: Excellent
Interpretation:
The benchmark shows excellent stability (score: 93.8/100) with very low
variation between runs (CV: 0.91%).
This indicates highly consistent performance suitable for latency-sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_qnn+s22ultra_android14_primary_time_series.png
Latency Stability Analysis: mv3_xnnq8+s22ultra_android14 (Primary)
================================================================================
Model: mv3_xnnq8
Device: s22ultra_android14
Dataset Overview:
- Number of samples: 88
- Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00
Central Tendency Metrics:
- Mean latency: 2.91 ms
- Median latency (P50): 2.54 ms
- Mean trimmed latency: 2.41 ms
- Median trimmed latency: 2.15 ms
Dispersion Metrics:
- Standard deviation: 1.14 ms
- Coefficient of variation (CV): 39.08%
- Interquartile range (IQR): 0.82 ms
- Trimmed standard deviation: 0.76 ms
- Trimmed coefficient of variation: 31.60%
Percentile Metrics:
- P50 (median): 2.54 ms
- P90: 3.88 ms
- P95: 4.60 ms
- P99: 5.91 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 5.6103
- P99/P50 ratio: 2.3319
- Mean rolling std (window=5): 0.79 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 15.37%
- Max trimming effect ratio: 38.83%
Stability Assessment:
- Overall stability score: 0.0/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 0.0/100) with significant
variation between runs (CV: 39.08%).
Performance is unpredictable and may lead to inconsistent user experience.
The significant difference between raw and trimmed means suggests
considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs.
The max/min ratio of 5.61 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 2.33 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+s22ultra_android14_primary_time_series.png
Latency Stability Analysis: mv3_xnnq8+pixel3_rooted_android (Primary)
================================================================================
Model: mv3_xnnq8
Device: pixel3_rooted_android
Dataset Overview:
- Number of samples: 148
- Date range: 2025-04-16 02:47:21+00:00 to 2025-04-29 01:17:49+00:00
Central Tendency Metrics:
- Mean latency: 5.93 ms
- Median latency (P50): 5.87 ms
- Mean trimmed latency: 5.51 ms
- Median trimmed latency: 5.45 ms
Dispersion Metrics:
- Standard deviation: 0.46 ms
- Coefficient of variation (CV): 7.68%
- Interquartile range (IQR): 0.56 ms
- Trimmed standard deviation: 0.27 ms
- Trimmed coefficient of variation: 4.84%
Percentile Metrics:
- P50 (median): 5.87 ms
- P90: 6.44 ms
- P95: 6.57 ms
- P99: 7.26 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.6964
- P99/P50 ratio: 1.2386
- Mean rolling std (window=5): 0.41 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 6.66%
- Max trimming effect ratio: 26.67%
Stability Assessment:
- Overall stability score: 46.9/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 46.9/100) with significant
variation between runs (CV: 7.68%).
Performance is unpredictable and may lead to inconsistent user experience.
The significant difference between raw and trimmed means suggests
considerable intra-run jitter (6.7%) with occasional outliers within benchmark runs.
The max/min ratio of 1.70 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.24 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+pixel3_rooted_android_primary_time_series.png
Latency Stability Analysis: llama3_qlora+iphone15max_ios17 (Primary)
================================================================================
Model: llama3_qlora
Device: iphone15max_ios17
Dataset Overview:
- Number of samples: 54
- Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00
Central Tendency Metrics:
- Mean latency: 12972.80 ms
- Median latency (P50): 12774.50 ms
Dispersion Metrics:
- Standard deviation: 483.26 ms
- Coefficient of variation (CV): 3.73%
- Interquartile range (IQR): 624.00 ms
Percentile Metrics:
- P50 (median): 12774.50 ms
- P90: 13389.70 ms
- P95: 13736.05 ms
- P99: 14730.49 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.1916
- P99/P50 ratio: 1.1531
- Mean rolling std (window=5): 431.32 ms
Throughput Metrics:
- Mean TPS: 10.18
- TPS coefficient of variation: 11.47%
Stability Assessment:
- Overall stability score: 75.2/100
- Overall stability rating: Moderate
Interpretation:
The benchmark shows moderate stability (score: 75.2/100) with noticeable
variation between runs (CV: 3.73%).
While average performance is acceptable, occasional latency spikes may occur.
The P99/P50 ratio of 1.15 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+iphone15max_ios17_primary_time_series.png
Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Primary)
================================================================================
Model: llama3_spinq
Device: iphone15max_ios17
Dataset Overview:
- Number of samples: 54
- Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00
Central Tendency Metrics:
- Mean latency: 12195.41 ms
- Median latency (P50): 12104.50 ms
Dispersion Metrics:
- Standard deviation: 461.27 ms
- Coefficient of variation (CV): 3.78%
- Interquartile range (IQR): 154.25 ms
Percentile Metrics:
- P50 (median): 12104.50 ms
- P90: 12567.20 ms
- P95: 12760.05 ms
- P99: 14052.31 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.3331
- P99/P50 ratio: 1.1609
- Mean rolling std (window=5): 365.79 ms
Throughput Metrics:
- Mean TPS: 13.89
- TPS coefficient of variation: 16.58%
Stability Assessment:
- Overall stability score: 72.9/100
- Overall stability rating: Moderate
Interpretation:
The benchmark shows moderate stability (score: 72.9/100) with noticeable
variation between runs (CV: 3.78%).
While average performance is acceptable, occasional latency spikes may occur.
The max/min ratio of 1.33 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.16 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+iphone15max_ios17_primary_time_series.png
Latency Stability Analysis: mv3_xnnq8+iphone15max_ios17 (Primary)
================================================================================
Model: mv3_xnnq8
Device: iphone15max_ios17
Dataset Overview:
- Number of samples: 54
- Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00
Central Tendency Metrics:
- Mean latency: 13.98 ms
- Median latency (P50): 14.00 ms
Dispersion Metrics:
- Standard deviation: 3.44 ms
- Coefficient of variation (CV): 24.60%
- Interquartile range (IQR): 4.00 ms
Percentile Metrics:
- P50 (median): 14.00 ms
- P90: 18.00 ms
- P95: 20.00 ms
- P99: 21.94 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 3.2857
- P99/P50 ratio: 1.5671
- Mean rolling std (window=5): 3.40 ms
Stability Assessment:
- Overall stability score: 10.8/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 10.8/100) with significant
variation between runs (CV: 24.60%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 3.29 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.57 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15max_ios17_primary_time_series.png
Latency Stability Analysis: mv3_coreml+iphone15max_ios17 (Primary)
================================================================================
Model: mv3_coreml
Device: iphone15max_ios17
Dataset Overview:
- Number of samples: 50
- Date range: 2025-04-30 05:23:09+00:00 to 2025-05-10 09:24:40+00:00
Central Tendency Metrics:
- Mean latency: 1.00 ms
- Median latency (P50): 1.00 ms
Dispersion Metrics:
- Standard deviation: 0.00 ms
- Coefficient of variation (CV): 0.00%
- Interquartile range (IQR): 0.00 ms
Percentile Metrics:
- P50 (median): 1.00 ms
- P90: 1.00 ms
- P95: 1.00 ms
- P99: 1.00 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.0000
- P99/P50 ratio: 1.0000
- Mean rolling std (window=5): 0.00 ms
Stability Assessment:
- Overall stability score: 100.0/100
- Overall stability rating: Excellent
Interpretation:
The benchmark shows excellent stability (score: 100.0/100) with very low
variation between runs (CV: 0.00%).
This indicates highly consistent performance suitable for latency-sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_coreml+iphone15max_ios17_primary_time_series.png
Latency Stability Analysis: mv3_mps+iphone15max_ios17 (Primary)
================================================================================
Model: mv3_mps
Device: iphone15max_ios17
Dataset Overview:
- Number of samples: 51
- Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00
Central Tendency Metrics:
- Mean latency: 1.25 ms
- Median latency (P50): 1.00 ms
Dispersion Metrics:
- Standard deviation: 0.44 ms
- Coefficient of variation (CV): 35.07%
- Interquartile range (IQR): 0.50 ms
Percentile Metrics:
- P50 (median): 1.00 ms
- P90: 2.00 ms
- P95: 2.00 ms
- P99: 2.00 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 2.0000
- P99/P50 ratio: 2.0000
- Mean rolling std (window=5): 0.39 ms
Stability Assessment:
- Overall stability score: 12.5/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 12.5/100) with significant
variation between runs (CV: 35.07%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 2.00 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 2.00 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_mps+iphone15max_ios17_primary_time_series.png
Latency Stability Analysis: llama3_qlora+iphone15_ios18 (Primary)
================================================================================
Model: llama3_qlora
Device: iphone15_ios18
Dataset Overview:
- Number of samples: 121
- Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00
Central Tendency Metrics:
- Mean latency: 23169.07 ms
- Median latency (P50): 21328.00 ms
Dispersion Metrics:
- Standard deviation: 5889.20 ms
- Coefficient of variation (CV): 25.42%
- Interquartile range (IQR): 8558.00 ms
Percentile Metrics:
- P50 (median): 21328.00 ms
- P90: 31324.00 ms
- P95: 33057.00 ms
- P99: 40256.40 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 3.0072
- P99/P50 ratio: 1.8875
- Mean rolling std (window=5): 4851.03 ms
Throughput Metrics:
- Mean TPS: 3.32
- TPS coefficient of variation: 34.24%
Stability Assessment:
- Overall stability score: 2.8/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 2.8/100) with significant
variation between runs (CV: 25.42%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 3.01 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.89 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+iphone15_ios18_primary_time_series.png
Latency Stability Analysis: llama3_spinq+iphone15_ios18 (Primary)
================================================================================
Model: llama3_spinq
Device: iphone15_ios18
Dataset Overview:
- Number of samples: 116
- Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00
Central Tendency Metrics:
- Mean latency: 22076.03 ms
- Median latency (P50): 20174.00 ms
Dispersion Metrics:
- Standard deviation: 6076.94 ms
- Coefficient of variation (CV): 27.53%
- Interquartile range (IQR): 7826.00 ms
Percentile Metrics:
- P50 (median): 20174.00 ms
- P90: 32507.00 ms
- P95: 34673.00 ms
- P99: 37690.75 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 2.7320
- P99/P50 ratio: 1.8683
- Mean rolling std (window=5): 4837.19 ms
Throughput Metrics:
- Mean TPS: 4.90
- TPS coefficient of variation: 35.91%
Stability Assessment:
- Overall stability score: 6.6/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 6.6/100) with significant
variation between runs (CV: 27.53%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 2.73 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.87 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+iphone15_ios18_primary_time_series.png
Latency Stability Analysis: mv3_xnnq8+iphone15_ios18 (Primary)
================================================================================
Model: mv3_xnnq8
Device: iphone15_ios18
Dataset Overview:
- Number of samples: 121
- Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00
Central Tendency Metrics:
- Mean latency: 48.23 ms
- Median latency (P50): 47.00 ms
Dispersion Metrics:
- Standard deviation: 6.19 ms
- Coefficient of variation (CV): 12.84%
- Interquartile range (IQR): 6.00 ms
Percentile Metrics:
- P50 (median): 47.00 ms
- P90: 55.00 ms
- P95: 57.00 ms
- P99: 64.40 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 2.2973
- P99/P50 ratio: 1.3702
- Mean rolling std (window=5): 5.53 ms
Stability Assessment:
- Overall stability score: 24.5/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 24.5/100) with significant
variation between runs (CV: 12.84%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 2.30 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.37 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15_ios18_primary_time_series.png
Latency Stability Analysis: mv3_coreml+iphone15_ios18 (Primary)
================================================================================
Model: mv3_coreml
Device: iphone15_ios18
Dataset Overview:
- Number of samples: 114
- Date range: 2025-04-30 05:23:09+00:00 to 2025-05-22 22:41:19+00:00
Central Tendency Metrics:
- Mean latency: 1.00 ms
- Median latency (P50): 1.00 ms
Dispersion Metrics:
- Standard deviation: 0.00 ms
- Coefficient of variation (CV): 0.00%
- Interquartile range (IQR): 0.00 ms
Percentile Metrics:
- P50 (median): 1.00 ms
- P90: 1.00 ms
- P95: 1.00 ms
- P99: 1.00 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.0000
- P99/P50 ratio: 1.0000
- Mean rolling std (window=5): 0.00 ms
Stability Assessment:
- Overall stability score: 100.0/100
- Overall stability rating: Excellent
Interpretation:
The benchmark shows excellent stability (score: 100.0/100) with very low
variation between runs (CV: 0.00%).
This indicates highly consistent performance suitable for latency-sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_coreml+iphone15_ios18_primary_time_series.png
Latency Stability Analysis: mv3_mps+iphone15_ios18 (Primary)
================================================================================
Model: mv3_mps
Device: iphone15_ios18
Dataset Overview:
- Number of samples: 118
- Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00
Central Tendency Metrics:
- Mean latency: 4.01 ms
- Median latency (P50): 4.00 ms
Dispersion Metrics:
- Standard deviation: 0.16 ms
- Coefficient of variation (CV): 3.99%
- Interquartile range (IQR): 0.00 ms
Percentile Metrics:
- P50 (median): 4.00 ms
- P90: 4.00 ms
- P95: 4.00 ms
- P99: 4.83 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.6667
- P99/P50 ratio: 1.2075
- Mean rolling std (window=5): 0.06 ms
Stability Assessment:
- Overall stability score: 66.5/100
- Overall stability rating: Moderate
Interpretation:
The benchmark shows moderate stability (score: 66.5/100) with noticeable
variation between runs (CV: 3.99%).
While average performance is acceptable, occasional latency spikes may occur.
The max/min ratio of 1.67 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.21 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_mps+iphone15_ios18_primary_time_series.png
====================================================================================================
===== ANALYZING REFERENCE DATASETS ================================================================
====================================================================================================
Latency Stability Analysis: llama3_qlora+s22_android13 (Reference)
================================================================================
Model: llama3_qlora
Device: s22_android13
Dataset Overview:
- Number of samples: 48
- Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 01:48:22+00:00
Central Tendency Metrics:
- Mean latency: 23841.98 ms
- Median latency (P50): 23381.83 ms
- Mean trimmed latency: 23727.32 ms
- Median trimmed latency: 23286.98 ms
Dispersion Metrics:
- Standard deviation: 2079.97 ms
- Coefficient of variation (CV): 8.72%
- Interquartile range (IQR): 3183.16 ms
- Trimmed standard deviation: 2068.95 ms
- Trimmed coefficient of variation: 8.72%
Percentile Metrics:
- P50 (median): 23381.83 ms
- P90: 26530.88 ms
- P95: 27370.45 ms
- P99: 28001.62 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.4300
- P99/P50 ratio: 1.1976
- Mean rolling std (window=5): 1967.20 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 0.48%
- Max trimming effect ratio: 1.00%
Throughput Metrics:
- Mean TPS: 32.18
- TPS coefficient of variation: 7.85%
Stability Assessment:
- Overall stability score: 46.1/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 46.1/100) with significant
variation between runs (CV: 8.72%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 1.43 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.20 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+s22_android13_reference_time_series.png
Latency Stability Analysis: llama3_spinq+s22_android13 (Reference)
================================================================================
Model: llama3_spinq
Device: s22_android13
Dataset Overview:
- Number of samples: 48
- Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 01:48:22+00:00
Central Tendency Metrics:
- Mean latency: 22774.60 ms
- Median latency (P50): 22491.89 ms
- Mean trimmed latency: 22648.15 ms
- Median trimmed latency: 22393.30 ms
Dispersion Metrics:
- Standard deviation: 1947.04 ms
- Coefficient of variation (CV): 8.55%
- Interquartile range (IQR): 3455.61 ms
- Trimmed standard deviation: 1930.79 ms
- Trimmed coefficient of variation: 8.53%
Percentile Metrics:
- P50 (median): 22491.89 ms
- P90: 25323.67 ms
- P95: 25925.82 ms
- P99: 26148.53 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.3483
- P99/P50 ratio: 1.1626
- Mean rolling std (window=5): 1745.98 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 0.55%
- Max trimming effect ratio: 2.26%
Throughput Metrics:
- Mean TPS: 32.96
- TPS coefficient of variation: 8.16%
Stability Assessment:
- Overall stability score: 48.8/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 48.8/100) with significant
variation between runs (CV: 8.55%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 1.35 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.16 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+s22_android13_reference_time_series.png
Latency Stability Analysis: mv3_qnn+s22_android13 (Reference)
================================================================================
Model: mv3_qnn
Device: s22_android13
Dataset Overview:
- Number of samples: 175
- Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00
Central Tendency Metrics:
- Mean latency: 1.44 ms
- Median latency (P50): 1.00 ms
- Mean trimmed latency: 1.35 ms
- Median trimmed latency: 1.00 ms
Dispersion Metrics:
- Standard deviation: 0.83 ms
- Coefficient of variation (CV): 57.29%
- Interquartile range (IQR): 0.06 ms
- Trimmed standard deviation: 0.65 ms
- Trimmed coefficient of variation: 48.32%
Percentile Metrics:
- P50 (median): 1.00 ms
- P90: 2.71 ms
- P95: 3.25 ms
- P99: 3.95 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 4.5354
- P99/P50 ratio: 3.9482
- Mean rolling std (window=5): 0.70 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 3.01%
- Max trimming effect ratio: 32.04%
Stability Assessment:
- Overall stability score: 0.0/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 0.0/100) with significant
variation between runs (CV: 57.29%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 4.54 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 3.95 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_qnn+s22_android13_reference_time_series.png
Latency Stability Analysis: mv3_xnnq8+s22_android13 (Reference)
================================================================================
Model: mv3_xnnq8
Device: s22_android13
Dataset Overview:
- Number of samples: 175
- Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00
Central Tendency Metrics:
- Mean latency: 1.92 ms
- Median latency (P50): 1.06 ms
- Mean trimmed latency: 1.74 ms
- Median trimmed latency: 1.06 ms
Dispersion Metrics:
- Standard deviation: 1.06 ms
- Coefficient of variation (CV): 55.09%
- Interquartile range (IQR): 1.63 ms
- Trimmed standard deviation: 0.85 ms
- Trimmed coefficient of variation: 48.75%
Percentile Metrics:
- P50 (median): 1.06 ms
- P90: 3.45 ms
- P95: 3.85 ms
- P99: 4.63 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 6.1313
- P99/P50 ratio: 4.3683
- Mean rolling std (window=5): 1.08 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 5.85%
- Max trimming effect ratio: 32.08%
Stability Assessment:
- Overall stability score: 0.0/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 0.0/100) with significant
variation between runs (CV: 55.09%).
Performance is unpredictable and may lead to inconsistent user experience.
The significant difference between raw and trimmed means suggests
considerable intra-run jitter (5.8%) with occasional outliers within benchmark runs.
The max/min ratio of 6.13 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 4.37 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+s22_android13_reference_time_series.png
Latency Stability Analysis: llama3_spinq+s22_android12 (Reference)
================================================================================
Model: llama3_spinq
Device: s22_android12
Dataset Overview:
- Number of samples: 48
- Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 01:48:22+00:00
Central Tendency Metrics:
- Mean latency: 23902.04 ms
- Median latency (P50): 22762.35 ms
- Mean trimmed latency: 23743.12 ms
- Median trimmed latency: 22590.46 ms
Dispersion Metrics:
- Standard deviation: 2609.94 ms
- Coefficient of variation (CV): 10.92%
- Interquartile range (IQR): 4958.35 ms
- Trimmed standard deviation: 2588.36 ms
- Trimmed coefficient of variation: 10.90%
Percentile Metrics:
- P50 (median): 22762.35 ms
- P90: 27325.35 ms
- P95: 27425.17 ms
- P99: 27527.28 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.3689
- P99/P50 ratio: 1.2093
- Mean rolling std (window=5): 2739.23 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 0.66%
- Max trimming effect ratio: 1.58%
Throughput Metrics:
- Mean TPS: 30.86
- TPS coefficient of variation: 10.84%
Stability Assessment:
- Overall stability score: 40.2/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 40.2/100) with significant
variation between runs (CV: 10.92%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 1.37 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.21 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+s22_android12_reference_time_series.png
Latency Stability Analysis: llama3_qlora+s22Ultra5G_android (Reference)
================================================================================
Model: llama3_qlora
Device: s22Ultra5G_android
Dataset Overview:
- Number of samples: 50
- Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 17:28:34+00:00
Central Tendency Metrics:
- Mean latency: 24685.50 ms
- Median latency (P50): 23145.09 ms
- Mean trimmed latency: 24531.08 ms
- Median trimmed latency: 22945.87 ms
Dispersion Metrics:
- Standard deviation: 2677.07 ms
- Coefficient of variation (CV): 10.84%
- Interquartile range (IQR): 5112.26 ms
- Trimmed standard deviation: 2657.25 ms
- Trimmed coefficient of variation: 10.83%
Percentile Metrics:
- P50 (median): 23145.09 ms
- P90: 28096.67 ms
- P95: 28195.43 ms
- P99: 29486.39 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.4421
- P99/P50 ratio: 1.2740
- Mean rolling std (window=5): 2527.53 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 0.62%
- Max trimming effect ratio: 1.43%
Throughput Metrics:
- Mean TPS: 30.61
- TPS coefficient of variation: 10.01%
Stability Assessment:
- Overall stability score: 37.6/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 37.6/100) with significant
variation between runs (CV: 10.84%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 1.44 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.27 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+s22Ultra5G_android_reference_time_series.png
Latency Stability Analysis: llama3_spinq+s22ultra_android12 (Reference)
================================================================================
Model: llama3_spinq
Device: s22ultra_android12
Dataset Overview:
- Number of samples: 41
- Date range: 2025-04-30 01:33:50+00:00 to 2025-05-13 17:16:32+00:00
Central Tendency Metrics:
- Mean latency: 24769.21 ms
- Median latency (P50): 23249.93 ms
- Mean trimmed latency: 24611.41 ms
- Median trimmed latency: 22998.15 ms
Dispersion Metrics:
- Standard deviation: 2714.46 ms
- Coefficient of variation (CV): 10.96%
- Interquartile range (IQR): 5002.67 ms
- Trimmed standard deviation: 2691.09 ms
- Trimmed coefficient of variation: 10.93%
Percentile Metrics:
- P50 (median): 23249.93 ms
- P90: 28126.42 ms
- P95: 28225.43 ms
- P99: 29591.36 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.4421
- P99/P50 ratio: 1.2728
- Mean rolling std (window=5): 2490.40 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 0.63%
- Max trimming effect ratio: 1.43%
Throughput Metrics:
- Mean TPS: 30.58
- TPS coefficient of variation: 10.08%
Stability Assessment:
- Overall stability score: 37.7/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 37.7/100) with significant
variation between runs (CV: 10.96%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 1.44 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.27 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+s22ultra_android12_reference_time_series.png
Latency Stability Analysis: mv3_xnnq8+s22ultra_android12 (Reference)
================================================================================
Model: mv3_xnnq8
Device: s22ultra_android12
Dataset Overview:
- Number of samples: 87
- Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00
Central Tendency Metrics:
- Mean latency: 3.63 ms
- Median latency (P50): 3.62 ms
- Mean trimmed latency: 2.94 ms
- Median trimmed latency: 2.87 ms
Dispersion Metrics:
- Standard deviation: 0.81 ms
- Coefficient of variation (CV): 22.35%
- Interquartile range (IQR): 0.94 ms
- Trimmed standard deviation: 0.60 ms
- Trimmed coefficient of variation: 20.24%
Percentile Metrics:
- P50 (median): 3.62 ms
- P90: 4.87 ms
- P95: 5.15 ms
- P99: 5.50 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 2.7228
- P99/P50 ratio: 1.5193
- Mean rolling std (window=5): 0.77 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 17.69%
- Max trimming effect ratio: 45.14%
Stability Assessment:
- Overall stability score: 15.5/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 15.5/100) with significant
variation between runs (CV: 22.35%).
Performance is unpredictable and may lead to inconsistent user experience.
The significant difference between raw and trimmed means suggests
considerable intra-run jitter (17.7%) with occasional outliers within benchmark runs.
The max/min ratio of 2.72 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.52 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+s22ultra_android12_reference_time_series.png
Latency Stability Analysis: mv3_qnn+s22ultra_android12 (Reference)
================================================================================
Model: mv3_qnn
Device: s22ultra_android12
Dataset Overview:
- Number of samples: 88
- Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00
Central Tendency Metrics:
- Mean latency: 1.02 ms
- Median latency (P50): 1.01 ms
- Mean trimmed latency: 1.01 ms
- Median trimmed latency: 1.01 ms
Dispersion Metrics:
- Standard deviation: 0.01 ms
- Coefficient of variation (CV): 1.35%
- Interquartile range (IQR): 0.01 ms
- Trimmed standard deviation: 0.01 ms
- Trimmed coefficient of variation: 1.15%
Percentile Metrics:
- P50 (median): 1.01 ms
- P90: 1.02 ms
- P95: 1.03 ms
- P99: 1.08 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.0990
- P99/P50 ratio: 1.0646
- Mean rolling std (window=5): 0.01 ms
Intra-Jitter Metrics (variability within runs):
- Mean trimming effect ratio: 0.16%
- Max trimming effect ratio: 1.94%
Stability Assessment:
- Overall stability score: 90.4/100
- Overall stability rating: Excellent
Interpretation:
The benchmark shows excellent stability (score: 90.4/100) with very low
variation between runs (CV: 1.35%).
This indicates highly consistent performance suitable for latency-sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_qnn+s22ultra_android12_reference_time_series.png
Latency Stability Analysis: llama3_qlora+iphone15max_ios17 (Reference)
================================================================================
Model: llama3_qlora
Device: iphone15max_ios17
Dataset Overview:
- Number of samples: 74
- Date range: 2025-02-21 03:12:32+00:00 to 2025-05-15 02:43:34+00:00
Central Tendency Metrics:
- Mean latency: 14133.01 ms
- Median latency (P50): 13132.50 ms
Dispersion Metrics:
- Standard deviation: 3019.85 ms
- Coefficient of variation (CV): 21.37%
- Interquartile range (IQR): 527.50 ms
Percentile Metrics:
- P50 (median): 13132.50 ms
- P90: 17308.70 ms
- P95: 21197.30 ms
- P99: 25167.92 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 2.3216
- P99/P50 ratio: 1.9165
- Mean rolling std (window=5): 1535.43 ms
Throughput Metrics:
- Mean TPS: 8.81
- TPS coefficient of variation: 27.97%
Stability Assessment:
- Overall stability score: 10.6/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 10.6/100) with significant
variation between runs (CV: 21.37%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 2.32 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.92 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+iphone15max_ios17_reference_time_series.png
Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Reference)
================================================================================
Model: llama3_spinq
Device: iphone15max_ios17
Dataset Overview:
- Number of samples: 72
- Date range: 2025-02-21 03:12:32+00:00 to 2025-05-15 02:43:34+00:00
Central Tendency Metrics:
- Mean latency: 13118.40 ms
- Median latency (P50): 12382.50 ms
Dispersion Metrics:
- Standard deviation: 2853.94 ms
- Coefficient of variation (CV): 21.76%
- Interquartile range (IQR): 680.50 ms
Percentile Metrics:
- P50 (median): 12382.50 ms
- P90: 14481.00 ms
- P95: 15865.05 ms
- P99: 26265.08 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 2.7878
- P99/P50 ratio: 2.1211
- Mean rolling std (window=5): 1464.57 ms
Throughput Metrics:
- Mean TPS: 12.30
- TPS coefficient of variation: 21.24%
Stability Assessment:
- Overall stability score: 2.7/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 2.7/100) with significant
variation between runs (CV: 21.76%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 2.79 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 2.12 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+iphone15max_ios17_reference_time_series.png
Latency Stability Analysis: mv3_xnnq8+iphone15max_ios17 (Reference)
================================================================================
Model: mv3_xnnq8
Device: iphone15max_ios17
Dataset Overview:
- Number of samples: 73
- Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00
Central Tendency Metrics:
- Mean latency: 13.97 ms
- Median latency (P50): 13.00 ms
Dispersion Metrics:
- Standard deviation: 4.74 ms
- Coefficient of variation (CV): 33.93%
- Interquartile range (IQR): 7.00 ms
Percentile Metrics:
- P50 (median): 13.00 ms
- P90: 21.80 ms
- P95: 22.00 ms
- P99: 25.40 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 4.1429
- P99/P50 ratio: 1.9538
- Mean rolling std (window=5): 4.51 ms
Stability Assessment:
- Overall stability score: 1.2/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 1.2/100) with significant
variation between runs (CV: 33.93%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 4.14 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.95 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15max_ios17_reference_time_series.png
Latency Stability Analysis: mv3_coreml+iphone15max_ios17 (Reference)
================================================================================
Model: mv3_coreml
Device: iphone15max_ios17
Dataset Overview:
- Number of samples: 21
- Date range: 2025-05-01 03:29:21+00:00 to 2025-05-22 02:53:58+00:00
Central Tendency Metrics:
- Mean latency: 1.00 ms
- Median latency (P50): 1.00 ms
Dispersion Metrics:
- Standard deviation: 0.00 ms
- Coefficient of variation (CV): 0.00%
- Interquartile range (IQR): 0.00 ms
Percentile Metrics:
- P50 (median): 1.00 ms
- P90: 1.00 ms
- P95: 1.00 ms
- P99: 1.00 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.0000
- P99/P50 ratio: 1.0000
- Mean rolling std (window=5): 0.00 ms
Stability Assessment:
- Overall stability score: 100.0/100
- Overall stability rating: Excellent
Interpretation:
The benchmark shows excellent stability (score: 100.0/100) with very low
variation between runs (CV: 0.00%).
This indicates highly consistent performance suitable for latency-sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_coreml+iphone15max_ios17_reference_time_series.png
Latency Stability Analysis: mv3_mps+iphone15max_ios17 (Reference)
================================================================================
Model: mv3_mps
Device: iphone15max_ios17
Dataset Overview:
- Number of samples: 72
- Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00
Central Tendency Metrics:
- Mean latency: 1.03 ms
- Median latency (P50): 1.00 ms
Dispersion Metrics:
- Standard deviation: 0.17 ms
- Coefficient of variation (CV): 16.10%
- Interquartile range (IQR): 0.00 ms
Percentile Metrics:
- P50 (median): 1.00 ms
- P90: 1.00 ms
- P95: 1.00 ms
- P99: 2.00 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 2.0000
- P99/P50 ratio: 2.0000
- Mean rolling std (window=5): 0.07 ms
Stability Assessment:
- Overall stability score: 12.5/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 12.5/100) with significant
variation between runs (CV: 16.10%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 2.00 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 2.00 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_mps+iphone15max_ios17_reference_time_series.png
Latency Stability Analysis: llama3_qlora+iphone15_ios18 (Reference)
================================================================================
Model: llama3_qlora
Device: iphone15_ios18
Dataset Overview:
- Number of samples: 70
- Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00
Central Tendency Metrics:
- Mean latency: 14429.20 ms
- Median latency (P50): 14401.00 ms
Dispersion Metrics:
- Standard deviation: 593.06 ms
- Coefficient of variation (CV): 4.11%
- Interquartile range (IQR): 637.25 ms
Percentile Metrics:
- P50 (median): 14401.00 ms
- P90: 14970.00 ms
- P95: 15441.85 ms
- P99: 16444.58 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.2195
- P99/P50 ratio: 1.1419
- Mean rolling std (window=5): 540.47 ms
Throughput Metrics:
- Mean TPS: 5.47
- TPS coefficient of variation: 13.24%
Stability Assessment:
- Overall stability score: 73.2/100
- Overall stability rating: Moderate
Interpretation:
The benchmark shows moderate stability (score: 73.2/100) with noticeable
variation between runs (CV: 4.11%).
While average performance is acceptable, occasional latency spikes may occur.
The max/min ratio of 1.22 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.14 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+iphone15_ios18_reference_time_series.png
Latency Stability Analysis: llama3_spinq+iphone15_ios18 (Reference)
================================================================================
Model: llama3_spinq
Device: iphone15_ios18
Dataset Overview:
- Number of samples: 74
- Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00
Central Tendency Metrics:
- Mean latency: 13820.34 ms
- Median latency (P50): 13724.00 ms
Dispersion Metrics:
- Standard deviation: 662.49 ms
- Coefficient of variation (CV): 4.79%
- Interquartile range (IQR): 683.50 ms
Percentile Metrics:
- P50 (median): 13724.00 ms
- P90: 14527.80 ms
- P95: 14992.20 ms
- P99: 15822.16 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.3302
- P99/P50 ratio: 1.1529
- Mean rolling std (window=5): 542.03 ms
Throughput Metrics:
- Mean TPS: 7.96
- TPS coefficient of variation: 14.45%
Stability Assessment:
- Overall stability score: 68.1/100
- Overall stability rating: Moderate
Interpretation:
The benchmark shows moderate stability (score: 68.1/100) with noticeable
variation between runs (CV: 4.79%).
While average performance is acceptable, occasional latency spikes may occur.
The max/min ratio of 1.33 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 1.15 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+iphone15_ios18_reference_time_series.png
Latency Stability Analysis: mv3_xnnq8+iphone15_ios18 (Reference)
================================================================================
Model: mv3_xnnq8
Device: iphone15_ios18
Dataset Overview:
- Number of samples: 73
- Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00
Central Tendency Metrics:
- Mean latency: 49.85 ms
- Median latency (P50): 44.00 ms
Dispersion Metrics:
- Standard deviation: 20.47 ms
- Coefficient of variation (CV): 41.06%
- Interquartile range (IQR): 12.00 ms
Percentile Metrics:
- P50 (median): 44.00 ms
- P90: 82.00 ms
- P95: 100.20 ms
- P99: 121.28 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 3.9355
- P99/P50 ratio: 2.7564
- Mean rolling std (window=5): 16.45 ms
Stability Assessment:
- Overall stability score: 0.0/100
- Overall stability rating: Poor
Interpretation:
The benchmark shows poor stability (score: 0.0/100) with significant
variation between runs (CV: 41.06%).
Performance is unpredictable and may lead to inconsistent user experience.
The max/min ratio of 3.94 indicates
substantial performance differences between the best and worst runs.
The P99/P50 ratio of 2.76 suggests
occasional latency spikes that could affect tail latency sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15_ios18_reference_time_series.png
Latency Stability Analysis: mv3_coreml+iphone15_ios18 (Reference)
================================================================================
Model: mv3_coreml
Device: iphone15_ios18
Dataset Overview:
- Number of samples: 21
- Date range: 2025-05-01 03:29:21+00:00 to 2025-05-22 02:53:58+00:00
Central Tendency Metrics:
- Mean latency: 1.00 ms
- Median latency (P50): 1.00 ms
Dispersion Metrics:
- Standard deviation: 0.00 ms
- Coefficient of variation (CV): 0.00%
- Interquartile range (IQR): 0.00 ms
Percentile Metrics:
- P50 (median): 1.00 ms
- P90: 1.00 ms
- P95: 1.00 ms
- P99: 1.00 ms
Inter-Jitter Metrics (variability between runs):
- Max/Min ratio: 1.0000
- P99/P50 ratio: 1.0000
- Mean rolling std (window=5): 0.00 ms
Stability Assessment:
- Overall stability score: 100.0/100
- Overall stability rating: Excellent
Interpretation:
The benchmark shows excellent stability (score: 100.0/100) with very low
variation between runs (CV: 0.00%).
This indicates highly consistent performance suitable for latency-sensitive applications.
================================================================================
Generated time series plot: stability_analysis_results/mv3_coreml+iphone15_ios18_reference_time_series.png
Latency Stability Analysis: mv3_mps+iphone15_ios18 (Reference)
================================================================================
Model: mv3_mps
Device: iphone15_ios18
Dataset Overview:
- Number of samples: 72
- Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00
Central Tendency Metrics:
- Mean latency: 3.75 ms
- Median latency (P50): 4.00 ms
Dispersion Metrics:
- Standard deviation:…1 parent 8aab7d0 commit 2269160Copy full SHA for 2269160
File tree
Expand file treeCollapse file tree
1 file changed
+1523
-0
lines changedFilter options
- .ci/scripts
Expand file treeCollapse file tree
1 file changed
+1523
-0
lines changed
0 commit comments