Skip to content

Commit 2269160

Browse files
guangy10Guang Yang
andauthored
Script for benchmark stability assessment (#10982)
### Summary The custom script for ET benchmark stability assessment. ``` pip install openpyxl tabulate matplotlib ``` Then ``` python .ci/scripts/analyze_benchmark_stability.py \ Benchmark\ Dataset\ with\ Private\ AWS\ Devices.xlsx \ --reference_file Benchmark\ Dataset\ with\ Public\ AWS\ Devices.xlsx ``` Datasets: - [Benchmark Dataset with Private AWS Devices.xlsx](https://github.com/user-attachments/files/20596657/Benchmark.Dataset.with.Private.AWS.Devices.xlsx) - [Benchmark Dataset with Public AWS Devices.xlsx](https://github.com/user-attachments/files/20596666/Benchmark.Dataset.with.Public.AWS.Devices.xlsx) The generated analysis: ``` Analyzing latency stability from primary file: /Users/guangyang/Desktop/Benchmark Dataset with Private AWS Devices.xlsx Using reference file for comparison: /Users/guangyang/Desktop/Benchmark Dataset with Public AWS Devices.xlsx ==================================================================================================== ===== LOADING PRIMARY DATASETS (Private) ========================================================== ==================================================================================================== Loading dataset: llama3_qlora+s22_android13 Loading dataset: llama3_spinq+s22_android13 Loading dataset: mv3_qnn+s22_android13 Loading dataset: mv3_xnnq8+s22_android13 Loading dataset: llama3_qlora+s22ultra_android14 Loading dataset: llama3_spinq+s22ultra_android14 Loading dataset: mv3_qnn+s22ultra_android14 Loading dataset: mv3_xnnq8+s22ultra_android14 Loading dataset: mv3_xnnq8+pixel3_rooted_android Loading dataset: llama3_qlora+iphone15max_ios17 Loading dataset: llama3_spinq+iphone15max_ios17 Loading dataset: mv3_xnnq8+iphone15max_ios17 Loading dataset: mv3_coreml+iphone15max_ios17 Loading dataset: mv3_mps+iphone15max_ios17 Loading dataset: llama3_qlora+iphone15_ios18 Loading dataset: llama3_spinq+iphone15_ios18 Loading dataset: mv3_xnnq8+iphone15_ios18 Loading dataset: mv3_coreml+iphone15_ios18 Loading dataset: mv3_mps+iphone15_ios18 ==================================================================================================== ===== LOADING REFERENCE DATASETS (Public) ========================================================= ==================================================================================================== Loading reference dataset: llama3_qlora+s22_android13 Loading reference dataset: llama3_spinq+s22_android13 Loading reference dataset: mv3_qnn+s22_android13 Loading reference dataset: mv3_xnnq8+s22_android13 Loading reference dataset: llama3_spinq+s22_android12 Loading reference dataset: llama3_qlora+s22Ultra5G_android Loading reference dataset: llama3_spinq+s22ultra_android12 Loading reference dataset: mv3_xnnq8+s22ultra_android12 Loading reference dataset: mv3_qnn+s22ultra_android12 Loading reference dataset: llama3_qlora+iphone15max_ios17 Loading reference dataset: llama3_spinq+iphone15max_ios17 Loading reference dataset: mv3_xnnq8+iphone15max_ios17 Loading reference dataset: mv3_coreml+iphone15max_ios17 Loading reference dataset: mv3_mps+iphone15max_ios17 Loading reference dataset: llama3_qlora+iphone15_ios18 Loading reference dataset: llama3_spinq+iphone15_ios18 Loading reference dataset: mv3_xnnq8+iphone15_ios18 Loading reference dataset: mv3_coreml+iphone15_ios18 Loading reference dataset: mv3_mps+iphone15_ios18 ==================================================================================================== ===== ANALYZING PRIMARY DATASETS ================================================================== ==================================================================================================== Latency Stability Analysis: llama3_qlora+s22_android13 (Primary) ================================================================================ Model: llama3_qlora Device: s22_android13 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 22502.10 ms - Median latency (P50): 22447.56 ms - Mean trimmed latency: 22388.87 ms - Median trimmed latency: 22343.47 ms Dispersion Metrics: - Standard deviation: 595.01 ms - Coefficient of variation (CV): 2.64% - Interquartile range (IQR): 858.26 ms - Trimmed standard deviation: 596.25 ms - Trimmed coefficient of variation: 2.66% Percentile Metrics: - P50 (median): 22447.56 ms - P90: 23231.99 ms - P95: 23518.35 ms - P99: 23910.11 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.1423 - P99/P50 ratio: 1.0652 - Mean rolling std (window=5): 539.36 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.50% - Max trimming effect ratio: 0.81% Throughput Metrics: - Mean TPS: 33.07 - TPS coefficient of variation: 6.92% Stability Assessment: - Overall stability score: 83.4/100 - Overall stability rating: Good Interpretation: The benchmark shows good stability (score: 83.4/100) with low variation between runs (CV: 2.64%). Performance is consistent and predictable for most use cases. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+s22_android13_primary_time_series.png Latency Stability Analysis: llama3_spinq+s22_android13 (Primary) ================================================================================ Model: llama3_spinq Device: s22_android13 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 21771.59 ms - Median latency (P50): 21668.24 ms - Mean trimmed latency: 21662.53 ms - Median trimmed latency: 21559.89 ms Dispersion Metrics: - Standard deviation: 514.89 ms - Coefficient of variation (CV): 2.36% - Interquartile range (IQR): 602.75 ms - Trimmed standard deviation: 515.03 ms - Trimmed coefficient of variation: 2.38% Percentile Metrics: - P50 (median): 21668.24 ms - P90: 22438.74 ms - P95: 22542.42 ms - P99: 23104.76 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.1452 - P99/P50 ratio: 1.0663 - Mean rolling std (window=5): 449.10 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.50% - Max trimming effect ratio: 0.89% Throughput Metrics: - Mean TPS: 33.76 - TPS coefficient of variation: 4.70% Stability Assessment: - Overall stability score: 84.7/100 - Overall stability rating: Good Interpretation: The benchmark shows good stability (score: 84.7/100) with low variation between runs (CV: 2.36%). Performance is consistent and predictable for most use cases. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+s22_android13_primary_time_series.png Latency Stability Analysis: mv3_qnn+s22_android13 (Primary) ================================================================================ Model: mv3_qnn Device: s22_android13 Dataset Overview: - Number of samples: 100 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-15 21:14:41+00:00 Central Tendency Metrics: - Mean latency: 1.01 ms - Median latency (P50): 1.00 ms - Mean trimmed latency: 1.00 ms - Median trimmed latency: 1.00 ms Dispersion Metrics: - Standard deviation: 0.02 ms - Coefficient of variation (CV): 2.34% - Interquartile range (IQR): 0.01 ms - Trimmed standard deviation: 0.02 ms - Trimmed coefficient of variation: 2.27% Percentile Metrics: - P50 (median): 1.00 ms - P90: 1.01 ms - P95: 1.01 ms - P99: 1.14 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.1919 - P99/P50 ratio: 1.1404 - Mean rolling std (window=5): 0.01 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.19% - Max trimming effect ratio: 1.00% Stability Assessment: - Overall stability score: 82.4/100 - Overall stability rating: Good Interpretation: The benchmark shows good stability (score: 82.4/100) with low variation between runs (CV: 2.34%). Performance is consistent and predictable for most use cases. The P99/P50 ratio of 1.14 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_qnn+s22_android13_primary_time_series.png Latency Stability Analysis: mv3_xnnq8+s22_android13 (Primary) ================================================================================ Model: mv3_xnnq8 Device: s22_android13 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 2.73 ms - Median latency (P50): 2.65 ms - Mean trimmed latency: 2.22 ms - Median trimmed latency: 2.10 ms Dispersion Metrics: - Standard deviation: 0.63 ms - Coefficient of variation (CV): 23.03% - Interquartile range (IQR): 0.95 ms - Trimmed standard deviation: 0.36 ms - Trimmed coefficient of variation: 15.98% Percentile Metrics: - P50 (median): 2.65 ms - P90: 3.59 ms - P95: 3.74 ms - P99: 4.46 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.4427 - P99/P50 ratio: 1.6812 - Mean rolling std (window=5): 0.60 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 16.52% - Max trimming effect ratio: 36.96% Stability Assessment: - Overall stability score: 14.9/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 14.9/100) with significant variation between runs (CV: 23.03%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (16.5%) with occasional outliers within benchmark runs. The max/min ratio of 2.44 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.68 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+s22_android13_primary_time_series.png Latency Stability Analysis: llama3_qlora+s22ultra_android14 (Primary) ================================================================================ Model: llama3_qlora Device: s22ultra_android14 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 25022.84 ms - Median latency (P50): 25427.33 ms - Mean trimmed latency: 24748.06 ms - Median trimmed latency: 25062.01 ms Dispersion Metrics: - Standard deviation: 1545.62 ms - Coefficient of variation (CV): 6.18% - Interquartile range (IQR): 2844.11 ms - Trimmed standard deviation: 1467.60 ms - Trimmed coefficient of variation: 5.93% Percentile Metrics: - P50 (median): 25427.33 ms - P90: 26581.31 ms - P95: 27184.07 ms - P99: 28668.97 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.2710 - P99/P50 ratio: 1.1275 - Mean rolling std (window=5): 1560.71 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 1.08% - Max trimming effect ratio: 4.80% Throughput Metrics: - Mean TPS: 28.35 - TPS coefficient of variation: 7.88% Stability Assessment: - Overall stability score: 62.5/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 62.5/100) with noticeable variation between runs (CV: 6.18%). While average performance is acceptable, occasional latency spikes may occur. The max/min ratio of 1.27 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.13 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+s22ultra_android14_primary_time_series.png Latency Stability Analysis: llama3_spinq+s22ultra_android14 (Primary) ================================================================================ Model: llama3_spinq Device: s22ultra_android14 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 24761.78 ms - Median latency (P50): 25043.89 ms - Mean trimmed latency: 24466.21 ms - Median trimmed latency: 24731.04 ms Dispersion Metrics: - Standard deviation: 1552.25 ms - Coefficient of variation (CV): 6.27% - Interquartile range (IQR): 1931.42 ms - Trimmed standard deviation: 1466.19 ms - Trimmed coefficient of variation: 5.99% Percentile Metrics: - P50 (median): 25043.89 ms - P90: 26163.60 ms - P95: 26948.68 ms - P99: 28868.51 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.3648 - P99/P50 ratio: 1.1527 - Mean rolling std (window=5): 1451.05 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 1.17% - Max trimming effect ratio: 4.90% Throughput Metrics: - Mean TPS: 29.85 - TPS coefficient of variation: 8.24% Stability Assessment: - Overall stability score: 60.3/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 60.3/100) with noticeable variation between runs (CV: 6.27%). While average performance is acceptable, occasional latency spikes may occur. The max/min ratio of 1.36 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.15 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+s22ultra_android14_primary_time_series.png Latency Stability Analysis: mv3_qnn+s22ultra_android14 (Primary) ================================================================================ Model: mv3_qnn Device: s22ultra_android14 Dataset Overview: - Number of samples: 100 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-15 21:14:41+00:00 Central Tendency Metrics: - Mean latency: 1.01 ms - Median latency (P50): 1.01 ms - Mean trimmed latency: 1.01 ms - Median trimmed latency: 1.01 ms Dispersion Metrics: - Standard deviation: 0.01 ms - Coefficient of variation (CV): 0.91% - Interquartile range (IQR): 0.01 ms - Trimmed standard deviation: 0.01 ms - Trimmed coefficient of variation: 0.70% Percentile Metrics: - P50 (median): 1.01 ms - P90: 1.02 ms - P95: 1.02 ms - P99: 1.03 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.0900 - P99/P50 ratio: 1.0204 - Mean rolling std (window=5): 0.01 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.19% - Max trimming effect ratio: 1.94% Stability Assessment: - Overall stability score: 93.8/100 - Overall stability rating: Excellent Interpretation: The benchmark shows excellent stability (score: 93.8/100) with very low variation between runs (CV: 0.91%). This indicates highly consistent performance suitable for latency-sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_qnn+s22ultra_android14_primary_time_series.png Latency Stability Analysis: mv3_xnnq8+s22ultra_android14 (Primary) ================================================================================ Model: mv3_xnnq8 Device: s22ultra_android14 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 2.91 ms - Median latency (P50): 2.54 ms - Mean trimmed latency: 2.41 ms - Median trimmed latency: 2.15 ms Dispersion Metrics: - Standard deviation: 1.14 ms - Coefficient of variation (CV): 39.08% - Interquartile range (IQR): 0.82 ms - Trimmed standard deviation: 0.76 ms - Trimmed coefficient of variation: 31.60% Percentile Metrics: - P50 (median): 2.54 ms - P90: 3.88 ms - P95: 4.60 ms - P99: 5.91 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 5.6103 - P99/P50 ratio: 2.3319 - Mean rolling std (window=5): 0.79 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 15.37% - Max trimming effect ratio: 38.83% Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 39.08%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs. The max/min ratio of 5.61 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.33 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+s22ultra_android14_primary_time_series.png Latency Stability Analysis: mv3_xnnq8+pixel3_rooted_android (Primary) ================================================================================ Model: mv3_xnnq8 Device: pixel3_rooted_android Dataset Overview: - Number of samples: 148 - Date range: 2025-04-16 02:47:21+00:00 to 2025-04-29 01:17:49+00:00 Central Tendency Metrics: - Mean latency: 5.93 ms - Median latency (P50): 5.87 ms - Mean trimmed latency: 5.51 ms - Median trimmed latency: 5.45 ms Dispersion Metrics: - Standard deviation: 0.46 ms - Coefficient of variation (CV): 7.68% - Interquartile range (IQR): 0.56 ms - Trimmed standard deviation: 0.27 ms - Trimmed coefficient of variation: 4.84% Percentile Metrics: - P50 (median): 5.87 ms - P90: 6.44 ms - P95: 6.57 ms - P99: 7.26 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.6964 - P99/P50 ratio: 1.2386 - Mean rolling std (window=5): 0.41 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 6.66% - Max trimming effect ratio: 26.67% Stability Assessment: - Overall stability score: 46.9/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 46.9/100) with significant variation between runs (CV: 7.68%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (6.7%) with occasional outliers within benchmark runs. The max/min ratio of 1.70 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.24 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+pixel3_rooted_android_primary_time_series.png Latency Stability Analysis: llama3_qlora+iphone15max_ios17 (Primary) ================================================================================ Model: llama3_qlora Device: iphone15max_ios17 Dataset Overview: - Number of samples: 54 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00 Central Tendency Metrics: - Mean latency: 12972.80 ms - Median latency (P50): 12774.50 ms Dispersion Metrics: - Standard deviation: 483.26 ms - Coefficient of variation (CV): 3.73% - Interquartile range (IQR): 624.00 ms Percentile Metrics: - P50 (median): 12774.50 ms - P90: 13389.70 ms - P95: 13736.05 ms - P99: 14730.49 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.1916 - P99/P50 ratio: 1.1531 - Mean rolling std (window=5): 431.32 ms Throughput Metrics: - Mean TPS: 10.18 - TPS coefficient of variation: 11.47% Stability Assessment: - Overall stability score: 75.2/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 75.2/100) with noticeable variation between runs (CV: 3.73%). While average performance is acceptable, occasional latency spikes may occur. The P99/P50 ratio of 1.15 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+iphone15max_ios17_primary_time_series.png Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Primary) ================================================================================ Model: llama3_spinq Device: iphone15max_ios17 Dataset Overview: - Number of samples: 54 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00 Central Tendency Metrics: - Mean latency: 12195.41 ms - Median latency (P50): 12104.50 ms Dispersion Metrics: - Standard deviation: 461.27 ms - Coefficient of variation (CV): 3.78% - Interquartile range (IQR): 154.25 ms Percentile Metrics: - P50 (median): 12104.50 ms - P90: 12567.20 ms - P95: 12760.05 ms - P99: 14052.31 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.3331 - P99/P50 ratio: 1.1609 - Mean rolling std (window=5): 365.79 ms Throughput Metrics: - Mean TPS: 13.89 - TPS coefficient of variation: 16.58% Stability Assessment: - Overall stability score: 72.9/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 72.9/100) with noticeable variation between runs (CV: 3.78%). While average performance is acceptable, occasional latency spikes may occur. The max/min ratio of 1.33 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.16 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+iphone15max_ios17_primary_time_series.png Latency Stability Analysis: mv3_xnnq8+iphone15max_ios17 (Primary) ================================================================================ Model: mv3_xnnq8 Device: iphone15max_ios17 Dataset Overview: - Number of samples: 54 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00 Central Tendency Metrics: - Mean latency: 13.98 ms - Median latency (P50): 14.00 ms Dispersion Metrics: - Standard deviation: 3.44 ms - Coefficient of variation (CV): 24.60% - Interquartile range (IQR): 4.00 ms Percentile Metrics: - P50 (median): 14.00 ms - P90: 18.00 ms - P95: 20.00 ms - P99: 21.94 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 3.2857 - P99/P50 ratio: 1.5671 - Mean rolling std (window=5): 3.40 ms Stability Assessment: - Overall stability score: 10.8/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 10.8/100) with significant variation between runs (CV: 24.60%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 3.29 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.57 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15max_ios17_primary_time_series.png Latency Stability Analysis: mv3_coreml+iphone15max_ios17 (Primary) ================================================================================ Model: mv3_coreml Device: iphone15max_ios17 Dataset Overview: - Number of samples: 50 - Date range: 2025-04-30 05:23:09+00:00 to 2025-05-10 09:24:40+00:00 Central Tendency Metrics: - Mean latency: 1.00 ms - Median latency (P50): 1.00 ms Dispersion Metrics: - Standard deviation: 0.00 ms - Coefficient of variation (CV): 0.00% - Interquartile range (IQR): 0.00 ms Percentile Metrics: - P50 (median): 1.00 ms - P90: 1.00 ms - P95: 1.00 ms - P99: 1.00 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.0000 - P99/P50 ratio: 1.0000 - Mean rolling std (window=5): 0.00 ms Stability Assessment: - Overall stability score: 100.0/100 - Overall stability rating: Excellent Interpretation: The benchmark shows excellent stability (score: 100.0/100) with very low variation between runs (CV: 0.00%). This indicates highly consistent performance suitable for latency-sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_coreml+iphone15max_ios17_primary_time_series.png Latency Stability Analysis: mv3_mps+iphone15max_ios17 (Primary) ================================================================================ Model: mv3_mps Device: iphone15max_ios17 Dataset Overview: - Number of samples: 51 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00 Central Tendency Metrics: - Mean latency: 1.25 ms - Median latency (P50): 1.00 ms Dispersion Metrics: - Standard deviation: 0.44 ms - Coefficient of variation (CV): 35.07% - Interquartile range (IQR): 0.50 ms Percentile Metrics: - P50 (median): 1.00 ms - P90: 2.00 ms - P95: 2.00 ms - P99: 2.00 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.0000 - P99/P50 ratio: 2.0000 - Mean rolling std (window=5): 0.39 ms Stability Assessment: - Overall stability score: 12.5/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 12.5/100) with significant variation between runs (CV: 35.07%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 2.00 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.00 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_mps+iphone15max_ios17_primary_time_series.png Latency Stability Analysis: llama3_qlora+iphone15_ios18 (Primary) ================================================================================ Model: llama3_qlora Device: iphone15_ios18 Dataset Overview: - Number of samples: 121 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00 Central Tendency Metrics: - Mean latency: 23169.07 ms - Median latency (P50): 21328.00 ms Dispersion Metrics: - Standard deviation: 5889.20 ms - Coefficient of variation (CV): 25.42% - Interquartile range (IQR): 8558.00 ms Percentile Metrics: - P50 (median): 21328.00 ms - P90: 31324.00 ms - P95: 33057.00 ms - P99: 40256.40 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 3.0072 - P99/P50 ratio: 1.8875 - Mean rolling std (window=5): 4851.03 ms Throughput Metrics: - Mean TPS: 3.32 - TPS coefficient of variation: 34.24% Stability Assessment: - Overall stability score: 2.8/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 2.8/100) with significant variation between runs (CV: 25.42%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 3.01 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.89 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+iphone15_ios18_primary_time_series.png Latency Stability Analysis: llama3_spinq+iphone15_ios18 (Primary) ================================================================================ Model: llama3_spinq Device: iphone15_ios18 Dataset Overview: - Number of samples: 116 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00 Central Tendency Metrics: - Mean latency: 22076.03 ms - Median latency (P50): 20174.00 ms Dispersion Metrics: - Standard deviation: 6076.94 ms - Coefficient of variation (CV): 27.53% - Interquartile range (IQR): 7826.00 ms Percentile Metrics: - P50 (median): 20174.00 ms - P90: 32507.00 ms - P95: 34673.00 ms - P99: 37690.75 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.7320 - P99/P50 ratio: 1.8683 - Mean rolling std (window=5): 4837.19 ms Throughput Metrics: - Mean TPS: 4.90 - TPS coefficient of variation: 35.91% Stability Assessment: - Overall stability score: 6.6/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 6.6/100) with significant variation between runs (CV: 27.53%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 2.73 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.87 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+iphone15_ios18_primary_time_series.png Latency Stability Analysis: mv3_xnnq8+iphone15_ios18 (Primary) ================================================================================ Model: mv3_xnnq8 Device: iphone15_ios18 Dataset Overview: - Number of samples: 121 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00 Central Tendency Metrics: - Mean latency: 48.23 ms - Median latency (P50): 47.00 ms Dispersion Metrics: - Standard deviation: 6.19 ms - Coefficient of variation (CV): 12.84% - Interquartile range (IQR): 6.00 ms Percentile Metrics: - P50 (median): 47.00 ms - P90: 55.00 ms - P95: 57.00 ms - P99: 64.40 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.2973 - P99/P50 ratio: 1.3702 - Mean rolling std (window=5): 5.53 ms Stability Assessment: - Overall stability score: 24.5/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 24.5/100) with significant variation between runs (CV: 12.84%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 2.30 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.37 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15_ios18_primary_time_series.png Latency Stability Analysis: mv3_coreml+iphone15_ios18 (Primary) ================================================================================ Model: mv3_coreml Device: iphone15_ios18 Dataset Overview: - Number of samples: 114 - Date range: 2025-04-30 05:23:09+00:00 to 2025-05-22 22:41:19+00:00 Central Tendency Metrics: - Mean latency: 1.00 ms - Median latency (P50): 1.00 ms Dispersion Metrics: - Standard deviation: 0.00 ms - Coefficient of variation (CV): 0.00% - Interquartile range (IQR): 0.00 ms Percentile Metrics: - P50 (median): 1.00 ms - P90: 1.00 ms - P95: 1.00 ms - P99: 1.00 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.0000 - P99/P50 ratio: 1.0000 - Mean rolling std (window=5): 0.00 ms Stability Assessment: - Overall stability score: 100.0/100 - Overall stability rating: Excellent Interpretation: The benchmark shows excellent stability (score: 100.0/100) with very low variation between runs (CV: 0.00%). This indicates highly consistent performance suitable for latency-sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_coreml+iphone15_ios18_primary_time_series.png Latency Stability Analysis: mv3_mps+iphone15_ios18 (Primary) ================================================================================ Model: mv3_mps Device: iphone15_ios18 Dataset Overview: - Number of samples: 118 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00 Central Tendency Metrics: - Mean latency: 4.01 ms - Median latency (P50): 4.00 ms Dispersion Metrics: - Standard deviation: 0.16 ms - Coefficient of variation (CV): 3.99% - Interquartile range (IQR): 0.00 ms Percentile Metrics: - P50 (median): 4.00 ms - P90: 4.00 ms - P95: 4.00 ms - P99: 4.83 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.6667 - P99/P50 ratio: 1.2075 - Mean rolling std (window=5): 0.06 ms Stability Assessment: - Overall stability score: 66.5/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 66.5/100) with noticeable variation between runs (CV: 3.99%). While average performance is acceptable, occasional latency spikes may occur. The max/min ratio of 1.67 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.21 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_mps+iphone15_ios18_primary_time_series.png ==================================================================================================== ===== ANALYZING REFERENCE DATASETS ================================================================ ==================================================================================================== Latency Stability Analysis: llama3_qlora+s22_android13 (Reference) ================================================================================ Model: llama3_qlora Device: s22_android13 Dataset Overview: - Number of samples: 48 - Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 01:48:22+00:00 Central Tendency Metrics: - Mean latency: 23841.98 ms - Median latency (P50): 23381.83 ms - Mean trimmed latency: 23727.32 ms - Median trimmed latency: 23286.98 ms Dispersion Metrics: - Standard deviation: 2079.97 ms - Coefficient of variation (CV): 8.72% - Interquartile range (IQR): 3183.16 ms - Trimmed standard deviation: 2068.95 ms - Trimmed coefficient of variation: 8.72% Percentile Metrics: - P50 (median): 23381.83 ms - P90: 26530.88 ms - P95: 27370.45 ms - P99: 28001.62 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.4300 - P99/P50 ratio: 1.1976 - Mean rolling std (window=5): 1967.20 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.48% - Max trimming effect ratio: 1.00% Throughput Metrics: - Mean TPS: 32.18 - TPS coefficient of variation: 7.85% Stability Assessment: - Overall stability score: 46.1/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 46.1/100) with significant variation between runs (CV: 8.72%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 1.43 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.20 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+s22_android13_reference_time_series.png Latency Stability Analysis: llama3_spinq+s22_android13 (Reference) ================================================================================ Model: llama3_spinq Device: s22_android13 Dataset Overview: - Number of samples: 48 - Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 01:48:22+00:00 Central Tendency Metrics: - Mean latency: 22774.60 ms - Median latency (P50): 22491.89 ms - Mean trimmed latency: 22648.15 ms - Median trimmed latency: 22393.30 ms Dispersion Metrics: - Standard deviation: 1947.04 ms - Coefficient of variation (CV): 8.55% - Interquartile range (IQR): 3455.61 ms - Trimmed standard deviation: 1930.79 ms - Trimmed coefficient of variation: 8.53% Percentile Metrics: - P50 (median): 22491.89 ms - P90: 25323.67 ms - P95: 25925.82 ms - P99: 26148.53 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.3483 - P99/P50 ratio: 1.1626 - Mean rolling std (window=5): 1745.98 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.55% - Max trimming effect ratio: 2.26% Throughput Metrics: - Mean TPS: 32.96 - TPS coefficient of variation: 8.16% Stability Assessment: - Overall stability score: 48.8/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 48.8/100) with significant variation between runs (CV: 8.55%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 1.35 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.16 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+s22_android13_reference_time_series.png Latency Stability Analysis: mv3_qnn+s22_android13 (Reference) ================================================================================ Model: mv3_qnn Device: s22_android13 Dataset Overview: - Number of samples: 175 - Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00 Central Tendency Metrics: - Mean latency: 1.44 ms - Median latency (P50): 1.00 ms - Mean trimmed latency: 1.35 ms - Median trimmed latency: 1.00 ms Dispersion Metrics: - Standard deviation: 0.83 ms - Coefficient of variation (CV): 57.29% - Interquartile range (IQR): 0.06 ms - Trimmed standard deviation: 0.65 ms - Trimmed coefficient of variation: 48.32% Percentile Metrics: - P50 (median): 1.00 ms - P90: 2.71 ms - P95: 3.25 ms - P99: 3.95 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 4.5354 - P99/P50 ratio: 3.9482 - Mean rolling std (window=5): 0.70 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 3.01% - Max trimming effect ratio: 32.04% Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 57.29%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 4.54 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 3.95 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_qnn+s22_android13_reference_time_series.png Latency Stability Analysis: mv3_xnnq8+s22_android13 (Reference) ================================================================================ Model: mv3_xnnq8 Device: s22_android13 Dataset Overview: - Number of samples: 175 - Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00 Central Tendency Metrics: - Mean latency: 1.92 ms - Median latency (P50): 1.06 ms - Mean trimmed latency: 1.74 ms - Median trimmed latency: 1.06 ms Dispersion Metrics: - Standard deviation: 1.06 ms - Coefficient of variation (CV): 55.09% - Interquartile range (IQR): 1.63 ms - Trimmed standard deviation: 0.85 ms - Trimmed coefficient of variation: 48.75% Percentile Metrics: - P50 (median): 1.06 ms - P90: 3.45 ms - P95: 3.85 ms - P99: 4.63 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 6.1313 - P99/P50 ratio: 4.3683 - Mean rolling std (window=5): 1.08 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 5.85% - Max trimming effect ratio: 32.08% Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 55.09%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (5.8%) with occasional outliers within benchmark runs. The max/min ratio of 6.13 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 4.37 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+s22_android13_reference_time_series.png Latency Stability Analysis: llama3_spinq+s22_android12 (Reference) ================================================================================ Model: llama3_spinq Device: s22_android12 Dataset Overview: - Number of samples: 48 - Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 01:48:22+00:00 Central Tendency Metrics: - Mean latency: 23902.04 ms - Median latency (P50): 22762.35 ms - Mean trimmed latency: 23743.12 ms - Median trimmed latency: 22590.46 ms Dispersion Metrics: - Standard deviation: 2609.94 ms - Coefficient of variation (CV): 10.92% - Interquartile range (IQR): 4958.35 ms - Trimmed standard deviation: 2588.36 ms - Trimmed coefficient of variation: 10.90% Percentile Metrics: - P50 (median): 22762.35 ms - P90: 27325.35 ms - P95: 27425.17 ms - P99: 27527.28 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.3689 - P99/P50 ratio: 1.2093 - Mean rolling std (window=5): 2739.23 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.66% - Max trimming effect ratio: 1.58% Throughput Metrics: - Mean TPS: 30.86 - TPS coefficient of variation: 10.84% Stability Assessment: - Overall stability score: 40.2/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 40.2/100) with significant variation between runs (CV: 10.92%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 1.37 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.21 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+s22_android12_reference_time_series.png Latency Stability Analysis: llama3_qlora+s22Ultra5G_android (Reference) ================================================================================ Model: llama3_qlora Device: s22Ultra5G_android Dataset Overview: - Number of samples: 50 - Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 17:28:34+00:00 Central Tendency Metrics: - Mean latency: 24685.50 ms - Median latency (P50): 23145.09 ms - Mean trimmed latency: 24531.08 ms - Median trimmed latency: 22945.87 ms Dispersion Metrics: - Standard deviation: 2677.07 ms - Coefficient of variation (CV): 10.84% - Interquartile range (IQR): 5112.26 ms - Trimmed standard deviation: 2657.25 ms - Trimmed coefficient of variation: 10.83% Percentile Metrics: - P50 (median): 23145.09 ms - P90: 28096.67 ms - P95: 28195.43 ms - P99: 29486.39 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.4421 - P99/P50 ratio: 1.2740 - Mean rolling std (window=5): 2527.53 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.62% - Max trimming effect ratio: 1.43% Throughput Metrics: - Mean TPS: 30.61 - TPS coefficient of variation: 10.01% Stability Assessment: - Overall stability score: 37.6/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 37.6/100) with significant variation between runs (CV: 10.84%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 1.44 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.27 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+s22Ultra5G_android_reference_time_series.png Latency Stability Analysis: llama3_spinq+s22ultra_android12 (Reference) ================================================================================ Model: llama3_spinq Device: s22ultra_android12 Dataset Overview: - Number of samples: 41 - Date range: 2025-04-30 01:33:50+00:00 to 2025-05-13 17:16:32+00:00 Central Tendency Metrics: - Mean latency: 24769.21 ms - Median latency (P50): 23249.93 ms - Mean trimmed latency: 24611.41 ms - Median trimmed latency: 22998.15 ms Dispersion Metrics: - Standard deviation: 2714.46 ms - Coefficient of variation (CV): 10.96% - Interquartile range (IQR): 5002.67 ms - Trimmed standard deviation: 2691.09 ms - Trimmed coefficient of variation: 10.93% Percentile Metrics: - P50 (median): 23249.93 ms - P90: 28126.42 ms - P95: 28225.43 ms - P99: 29591.36 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.4421 - P99/P50 ratio: 1.2728 - Mean rolling std (window=5): 2490.40 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.63% - Max trimming effect ratio: 1.43% Throughput Metrics: - Mean TPS: 30.58 - TPS coefficient of variation: 10.08% Stability Assessment: - Overall stability score: 37.7/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 37.7/100) with significant variation between runs (CV: 10.96%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 1.44 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.27 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+s22ultra_android12_reference_time_series.png Latency Stability Analysis: mv3_xnnq8+s22ultra_android12 (Reference) ================================================================================ Model: mv3_xnnq8 Device: s22ultra_android12 Dataset Overview: - Number of samples: 87 - Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00 Central Tendency Metrics: - Mean latency: 3.63 ms - Median latency (P50): 3.62 ms - Mean trimmed latency: 2.94 ms - Median trimmed latency: 2.87 ms Dispersion Metrics: - Standard deviation: 0.81 ms - Coefficient of variation (CV): 22.35% - Interquartile range (IQR): 0.94 ms - Trimmed standard deviation: 0.60 ms - Trimmed coefficient of variation: 20.24% Percentile Metrics: - P50 (median): 3.62 ms - P90: 4.87 ms - P95: 5.15 ms - P99: 5.50 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.7228 - P99/P50 ratio: 1.5193 - Mean rolling std (window=5): 0.77 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 17.69% - Max trimming effect ratio: 45.14% Stability Assessment: - Overall stability score: 15.5/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 15.5/100) with significant variation between runs (CV: 22.35%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (17.7%) with occasional outliers within benchmark runs. The max/min ratio of 2.72 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.52 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+s22ultra_android12_reference_time_series.png Latency Stability Analysis: mv3_qnn+s22ultra_android12 (Reference) ================================================================================ Model: mv3_qnn Device: s22ultra_android12 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00 Central Tendency Metrics: - Mean latency: 1.02 ms - Median latency (P50): 1.01 ms - Mean trimmed latency: 1.01 ms - Median trimmed latency: 1.01 ms Dispersion Metrics: - Standard deviation: 0.01 ms - Coefficient of variation (CV): 1.35% - Interquartile range (IQR): 0.01 ms - Trimmed standard deviation: 0.01 ms - Trimmed coefficient of variation: 1.15% Percentile Metrics: - P50 (median): 1.01 ms - P90: 1.02 ms - P95: 1.03 ms - P99: 1.08 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.0990 - P99/P50 ratio: 1.0646 - Mean rolling std (window=5): 0.01 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.16% - Max trimming effect ratio: 1.94% Stability Assessment: - Overall stability score: 90.4/100 - Overall stability rating: Excellent Interpretation: The benchmark shows excellent stability (score: 90.4/100) with very low variation between runs (CV: 1.35%). This indicates highly consistent performance suitable for latency-sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_qnn+s22ultra_android12_reference_time_series.png Latency Stability Analysis: llama3_qlora+iphone15max_ios17 (Reference) ================================================================================ Model: llama3_qlora Device: iphone15max_ios17 Dataset Overview: - Number of samples: 74 - Date range: 2025-02-21 03:12:32+00:00 to 2025-05-15 02:43:34+00:00 Central Tendency Metrics: - Mean latency: 14133.01 ms - Median latency (P50): 13132.50 ms Dispersion Metrics: - Standard deviation: 3019.85 ms - Coefficient of variation (CV): 21.37% - Interquartile range (IQR): 527.50 ms Percentile Metrics: - P50 (median): 13132.50 ms - P90: 17308.70 ms - P95: 21197.30 ms - P99: 25167.92 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.3216 - P99/P50 ratio: 1.9165 - Mean rolling std (window=5): 1535.43 ms Throughput Metrics: - Mean TPS: 8.81 - TPS coefficient of variation: 27.97% Stability Assessment: - Overall stability score: 10.6/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 10.6/100) with significant variation between runs (CV: 21.37%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 2.32 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.92 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+iphone15max_ios17_reference_time_series.png Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Reference) ================================================================================ Model: llama3_spinq Device: iphone15max_ios17 Dataset Overview: - Number of samples: 72 - Date range: 2025-02-21 03:12:32+00:00 to 2025-05-15 02:43:34+00:00 Central Tendency Metrics: - Mean latency: 13118.40 ms - Median latency (P50): 12382.50 ms Dispersion Metrics: - Standard deviation: 2853.94 ms - Coefficient of variation (CV): 21.76% - Interquartile range (IQR): 680.50 ms Percentile Metrics: - P50 (median): 12382.50 ms - P90: 14481.00 ms - P95: 15865.05 ms - P99: 26265.08 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.7878 - P99/P50 ratio: 2.1211 - Mean rolling std (window=5): 1464.57 ms Throughput Metrics: - Mean TPS: 12.30 - TPS coefficient of variation: 21.24% Stability Assessment: - Overall stability score: 2.7/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 2.7/100) with significant variation between runs (CV: 21.76%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 2.79 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.12 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+iphone15max_ios17_reference_time_series.png Latency Stability Analysis: mv3_xnnq8+iphone15max_ios17 (Reference) ================================================================================ Model: mv3_xnnq8 Device: iphone15max_ios17 Dataset Overview: - Number of samples: 73 - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 13.97 ms - Median latency (P50): 13.00 ms Dispersion Metrics: - Standard deviation: 4.74 ms - Coefficient of variation (CV): 33.93% - Interquartile range (IQR): 7.00 ms Percentile Metrics: - P50 (median): 13.00 ms - P90: 21.80 ms - P95: 22.00 ms - P99: 25.40 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 4.1429 - P99/P50 ratio: 1.9538 - Mean rolling std (window=5): 4.51 ms Stability Assessment: - Overall stability score: 1.2/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 1.2/100) with significant variation between runs (CV: 33.93%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 4.14 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.95 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15max_ios17_reference_time_series.png Latency Stability Analysis: mv3_coreml+iphone15max_ios17 (Reference) ================================================================================ Model: mv3_coreml Device: iphone15max_ios17 Dataset Overview: - Number of samples: 21 - Date range: 2025-05-01 03:29:21+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 1.00 ms - Median latency (P50): 1.00 ms Dispersion Metrics: - Standard deviation: 0.00 ms - Coefficient of variation (CV): 0.00% - Interquartile range (IQR): 0.00 ms Percentile Metrics: - P50 (median): 1.00 ms - P90: 1.00 ms - P95: 1.00 ms - P99: 1.00 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.0000 - P99/P50 ratio: 1.0000 - Mean rolling std (window=5): 0.00 ms Stability Assessment: - Overall stability score: 100.0/100 - Overall stability rating: Excellent Interpretation: The benchmark shows excellent stability (score: 100.0/100) with very low variation between runs (CV: 0.00%). This indicates highly consistent performance suitable for latency-sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_coreml+iphone15max_ios17_reference_time_series.png Latency Stability Analysis: mv3_mps+iphone15max_ios17 (Reference) ================================================================================ Model: mv3_mps Device: iphone15max_ios17 Dataset Overview: - Number of samples: 72 - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 1.03 ms - Median latency (P50): 1.00 ms Dispersion Metrics: - Standard deviation: 0.17 ms - Coefficient of variation (CV): 16.10% - Interquartile range (IQR): 0.00 ms Percentile Metrics: - P50 (median): 1.00 ms - P90: 1.00 ms - P95: 1.00 ms - P99: 2.00 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.0000 - P99/P50 ratio: 2.0000 - Mean rolling std (window=5): 0.07 ms Stability Assessment: - Overall stability score: 12.5/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 12.5/100) with significant variation between runs (CV: 16.10%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 2.00 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.00 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_mps+iphone15max_ios17_reference_time_series.png Latency Stability Analysis: llama3_qlora+iphone15_ios18 (Reference) ================================================================================ Model: llama3_qlora Device: iphone15_ios18 Dataset Overview: - Number of samples: 70 - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 14429.20 ms - Median latency (P50): 14401.00 ms Dispersion Metrics: - Standard deviation: 593.06 ms - Coefficient of variation (CV): 4.11% - Interquartile range (IQR): 637.25 ms Percentile Metrics: - P50 (median): 14401.00 ms - P90: 14970.00 ms - P95: 15441.85 ms - P99: 16444.58 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.2195 - P99/P50 ratio: 1.1419 - Mean rolling std (window=5): 540.47 ms Throughput Metrics: - Mean TPS: 5.47 - TPS coefficient of variation: 13.24% Stability Assessment: - Overall stability score: 73.2/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 73.2/100) with noticeable variation between runs (CV: 4.11%). While average performance is acceptable, occasional latency spikes may occur. The max/min ratio of 1.22 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.14 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+iphone15_ios18_reference_time_series.png Latency Stability Analysis: llama3_spinq+iphone15_ios18 (Reference) ================================================================================ Model: llama3_spinq Device: iphone15_ios18 Dataset Overview: - Number of samples: 74 - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 13820.34 ms - Median latency (P50): 13724.00 ms Dispersion Metrics: - Standard deviation: 662.49 ms - Coefficient of variation (CV): 4.79% - Interquartile range (IQR): 683.50 ms Percentile Metrics: - P50 (median): 13724.00 ms - P90: 14527.80 ms - P95: 14992.20 ms - P99: 15822.16 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.3302 - P99/P50 ratio: 1.1529 - Mean rolling std (window=5): 542.03 ms Throughput Metrics: - Mean TPS: 7.96 - TPS coefficient of variation: 14.45% Stability Assessment: - Overall stability score: 68.1/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 68.1/100) with noticeable variation between runs (CV: 4.79%). While average performance is acceptable, occasional latency spikes may occur. The max/min ratio of 1.33 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.15 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+iphone15_ios18_reference_time_series.png Latency Stability Analysis: mv3_xnnq8+iphone15_ios18 (Reference) ================================================================================ Model: mv3_xnnq8 Device: iphone15_ios18 Dataset Overview: - Number of samples: 73 - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 49.85 ms - Median latency (P50): 44.00 ms Dispersion Metrics: - Standard deviation: 20.47 ms - Coefficient of variation (CV): 41.06% - Interquartile range (IQR): 12.00 ms Percentile Metrics: - P50 (median): 44.00 ms - P90: 82.00 ms - P95: 100.20 ms - P99: 121.28 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 3.9355 - P99/P50 ratio: 2.7564 - Mean rolling std (window=5): 16.45 ms Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 41.06%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 3.94 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.76 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15_ios18_reference_time_series.png Latency Stability Analysis: mv3_coreml+iphone15_ios18 (Reference) ================================================================================ Model: mv3_coreml Device: iphone15_ios18 Dataset Overview: - Number of samples: 21 - Date range: 2025-05-01 03:29:21+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 1.00 ms - Median latency (P50): 1.00 ms Dispersion Metrics: - Standard deviation: 0.00 ms - Coefficient of variation (CV): 0.00% - Interquartile range (IQR): 0.00 ms Percentile Metrics: - P50 (median): 1.00 ms - P90: 1.00 ms - P95: 1.00 ms - P99: 1.00 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.0000 - P99/P50 ratio: 1.0000 - Mean rolling std (window=5): 0.00 ms Stability Assessment: - Overall stability score: 100.0/100 - Overall stability rating: Excellent Interpretation: The benchmark shows excellent stability (score: 100.0/100) with very low variation between runs (CV: 0.00%). This indicates highly consistent performance suitable for latency-sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_coreml+iphone15_ios18_reference_time_series.png Latency Stability Analysis: mv3_mps+iphone15_ios18 (Reference) ================================================================================ Model: mv3_mps Device: iphone15_ios18 Dataset Overview: - Number of samples: 72 - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 3.75 ms - Median latency (P50): 4.00 ms Dispersion Metrics: - Standard deviation:…
1 parent 8aab7d0 commit 2269160

File tree

1 file changed

+1523
-0
lines changed

1 file changed

+1523
-0
lines changed

0 commit comments

Comments
 (0)