Replies: 5 comments 8 replies
-
THis is a good analysis. Let me dive into it a bit |
Beta Was this translation helpful? Give feedback.
-
Thanks for compiling this. Looking at the next steps, are we intending to analyze the impact of warmup iterations or high iteration counts? It would be very interesting to see the data on runtimes per iteration over say 1000 iterations run back to back. Does it stabilize? Or randomly spike? Can we rely on median latency being reliable over a large number of iterations? |
Beta Was this translation helpful? Give feedback.
-
Amazing how we can do such analysis in OSS. ❤️ Couple of random thoughts while reading the post,
|
Beta Was this translation helpful? Give feedback.
-
FYI, with more data from public Android devices are found, I just updated the post to incorporate the private vs. public comparison. The metrics from the new data strengthens the conclusions, indicating using private AWS can provide decent stability for Android benchmarking. cc: @cbilgin @kimishpatel @digantdesai |
Beta Was this translation helpful? Give feedback.
-
@guangy10 overall conclusion make sense. I would like to offer my views on the way forward
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Benchmark Infra Stability Assessment with private AWS devices
TL;DR
Analysis reveals that private AWS device can provide acceptable stability across all tested platforms (Android, iOS), delegates (QNN, XNNPACK, CoreML, MPS) and models (Llama3.2-1b and MobileNetV3), demonstrating that our private AWS infrastructure can deliver consistent benchmarking results.
Understanding Stability Metrics
To properly assess the stability of ML model inference latency, I use several key statistical metrics:
And a composite stability score (0-100 scale) is calculated using weighted CV, Max/Min ratio, and P99/P50 ratio.
Intra-primary (private) Dataset Stability Comparison
I will begin the analysis by examining the key metrics for the primary (private) dataset. This section focuses on assessing the inherent stability of our benchmarking environment before making any comparisons to public infrastructure. By analyzing key statistical metrics mentioned above across different model and device combinations, we can establish a baseline understanding of performance consistency and stability.
Overall Stability Summary:
Device-based Comparison:
My insights and recommendations
The analysis of latency stability across private AWS devices reveals certain patterns in performance consistency:
Intra-private analysis reveals that private iPhone and S22 can provide acceptable stability across all tested delegates (QNN, CoreML, MPS, XNNPACK) and models (Llama3.2-1b and MobileNetV3), demonstrating that our private AWS infrastructure can deliver consistent benchmarking results.
Inter-dataset (private & public) Stability Comparison
To assess whether private AWS devices provide better stability than their public counterparts, here I conducted a detailed comparison between matching datasets from both environments. This section presents an apple-to-apple comparison of benchmark stability for identical model-device combinations, allowing us to directly evaluate the benefits of moving to use private infrastructure.
1. llama3_spinq+s22_android13 (Private) vs llama3_spinq+s22_android13 (Public)
Metrics Comparison:
Interpretation:
2. mv3_qnn+s22_android13 (Private) with mv3_qnn+s22_android13 (Public)
Metrics Comparison:
Interpretation:
3. mv3_xnnq8+s22_android13 (Private) vs. mv3_xnnq8+s22_android13 (Public)
Metrics Comparison:
Interpretation:
4. llama3_qlora+iphone15max_ios17 (Private) with llama3_qlora+iphone15max_ios17 (Public)
Metrics Comparison:
Interpretation:
5. mv3_xnnq8+iphone15max_ios17 (Private) with mv3_xnnq8+iphone15max_ios17 (Public)
Metrics Comparison:
Interpretation:
Though both are not ideal, but private environment shows better stability with a 837.9% higher stability score (Private: 10.8/100 vs Public: 1.2/100), and 27.5% lower coefficient of variation, indicating more consistent performance over public devices.
6. mv3_coreml+iphone15max_ios17 (Private) with mv3_coreml+iphone15max_ios17 (Public)
Metrics Comparison:
Interpretation:
Both environments show perfect and identical stability scores.
7. mv3_mps+iphone15max_ios17 (Private) with mv3_mps+iphone15max_ios17 (Public)
Metrics Comparison:
Interpretation:
Overall Private vs Public Comparison:
Summary:
Private devices consistently outperform public devices on both platforms, with Android showing slightly larger performance gains and more dramatic stability improvements.
Detailed Stability Analysis on Individual Dataset - Primary (Private)
The full list of individual dataset analysis can be downloaded here. In this section I will highlight detailed statistical metrics for only a few selected datasets.
1. Latency Stability Analysis: llama3_spinq+s22_android13 (Primary)
2. Latency Stability Analysis: mv3_qnn+s22_android13 (Primary)
3. Latency Stability Analysis: mv3_xnnq8+s22_android13 (Primary)
4. Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Primary)
5. Latency Stability Analysis: mv3_xnnq8+iphone15max_ios17 (Primary)
6. Latency Stability Analysis: mv3_mps+iphone15max_ios17 (Primary)
7. Latency Stability Analysis: mv3_coreml+iphone15max_ios17 (Primary)
Summary of Conclusions and Next Steps
ExecuTorch Benchmarking
The analysis shows that private AWS devices provide significantly better stability for both Android and iOS benchmarking, with Android showing slightly larger performance gains and more dramatic stability improvements.
As next steps, I would suggest:
DevX Improvements
Our current benchmarking infrastructure has critical gaps that limit our ability to understand and address stability issues. These limitations are particularly problematic when trying to diagnose the root causes of performance variations we've observed across devices.
Current Gaps
Addressing these gaps is urgent to establish a reliable benchmarking infrastructure. Without these improvements, we risk making timely decisions and basing conclusions on misleading or incomplete data.
References
Here I attached the source of data and my script if anyone want to repeat the work. Please also use it as a reference when filling the infra gaps above.
The script used for analysis
Data source:
Datasets from Primary/Private AWS devices:
Benchmark Dataset with Private AWS Devices.xlsx
Datasets from Reference/Public AWS devices:
Benchmark Dataset with Public AWS Devices.xlsx
Each tab represent one dataset collected with one model+config+device combination. The source of the data are copied from the ExecuTorch benchmark dashboard
Beta Was this translation helpful? Give feedback.
All reactions