-
-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Basic information
- Board URL (official): https://www.supermicro.com/en/products/system/megadc/2u/ars-211me-fnr
- Board purchased from: (Provided by Ampere/Supermicro)
- Board purchase date: Oct 14, 2024
- Board specs (as tested): A192-32X, 512GB DDR5 (5200 MT/s)
- Board price (as tested): (If you have to ask...)
Linux/system information
# output of `screenfetch`
ubuntu@ubuntu:~$ screenfetch
./+o+- ubuntu@ubuntu
yyyyy- -yyyyyy+ OS: Ubuntu 24.04 noble
://+//////-yyyyyyo Kernel: aarch64 Linux 6.8.0-39-generic-64k
.++ .:/++++++/-.+sss/` Uptime: 23m
.:++o: /++++++++/:--:/- Packages: 810
o:+o+:++.`..```.-/oo+++++/ Shell: bash 5.2.21
.:+o:+o/. `+sssoo+/ Disk: 19G / 101G (20%)
.++/+:+oo+o:` /sssooo. CPU: Ampere Ampere-1a @ 192x 3.2GHz
/+++//+:`oo+o /::--:. GPU: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52)
\+/+o+++`o++o ++////. RAM: 31390MiB / 522867MiB
.++.o+++oo+:` /dddhhh.
.+.o+oo:. `oddhhhh+
\+.++o+o``-````.:ohdhhhhh+
`:o+++ `ohhhhhhhhyo++os:
.o:`.syhhhhhhh/.oo++o`
/osyyyyyyo++ooo+++/
````` +oo+++o\:
`oo++.
# output of `uname -a`
Linux ubuntu 6.8.0-39-generic-64k #39-Ubuntu SMP PREEMPT_DYNAMIC Sat Jul 6 11:08:16 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
Benchmark results
CPU
- Geekbench 6: 1309 single / 15160 multi (result lower performance though, see Geekbench w/ page sizes above 4K and STH article)
- Geekbench 5: 958 single / 80639 multi (result)
- 3,027 Gflops at 692W / 4.37 Gflops/W (geerlingguy/top500-benchmark HPL result)
Power
- Idle power draw (at wall): 199 W (30 W CPU / 78 W IO - 108W SoC package power from
sensors
) - Maximum simulated power draw (
stress-ng --matrix 0
): 500 W - During Geekbench multicore benchmark: 300-600 W (depending on Geekbench version)
- During
top500
HPL benchmark: 692 W
Disk
Samsung NVMe SSD - 983 DCT M.2 960GB
Benchmark | Result |
---|---|
iozone 4K random read | 50.35 MB/s |
iozone 4K random write | 216.04 MB/s |
iozone 1M random read | 2067.82 MB/s |
iozone 1M random write | 1295.13 MB/s |
iozone 1M sequential read | 2098.31 MB/s |
iozone 1M sequential write | 1291.07 MB/s |
wget https://raw.githubusercontent.com/geerlingguy/pi-cluster/master/benchmarks/disk-benchmark.sh
chmod +x disk-benchmark.sh
sudo MOUNT_PATH=/ TEST_SIZE=1g ./disk-benchmark.sh
Samsung NVMe SSD - MZQL21T9HCJR-00A07
Specs: https://semiconductor.samsung.com/ssd/datacenter-ssd/pm9a3/mzql21t9hcjr-00a07/
Single disk
Benchmark | Result |
---|---|
iozone 4K random read | 60.19 MB/s |
iozone 4K random write | 284.72 MB/s |
iozone 1M random read | 3777.29 MB/s |
iozone 1M random write | 2686.80 MB/s |
iozone 1M sequential read | 3773.44 MB/s |
iozone 1M sequential write | 2680.90 MB/s |
RAID 0 (mdadm)
Benchmark | Result |
---|---|
iozone 4K random read | 58.05 MB/s |
iozone 4K random write | 250.06 MB/s |
iozone 1M random read | 5444.03 MB/s |
iozone 1M random write | 4411.07 MB/s |
iozone 1M sequential read | 7120.75 MB/s |
iozone 1M sequential write | 4458.30 MB/s |
Network
iperf3
results:
iperf3 -c $SERVER_IP
: 21.4 Gbpsiperf3 -c $SERVER_IP --reverse
: 18.8 Gbpsiperf3 -c $SERVER_IP --bidir
: 8.08 Gbps up, 22.2 Gbps down
Tested on one of the two built-in Broadcom BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
interfaces, to my HL15 Arm NAS (see: geerlingguy/arm-nas#16), routed through a Mikrotik 25G Cloud Router.
GPU
Did not test - this server doesn't have a GPU, just the ASPEED integrated BMC VGA graphics, which are not suitable for much GPU-accelerated gaming or LLMs, lol. Just render it on CPU!
Memory
tinymembench
results:
Click to expand memory benchmark result
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)
==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for 'copy' tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==
==========================================================================
C copy backwards : 14199.7 MB/s (0.3%)
C copy backwards (32 byte blocks) : 13871.7 MB/s
C copy backwards (64 byte blocks) : 13879.6 MB/s (0.2%)
C copy : 13890.6 MB/s (0.2%)
C copy prefetched (32 bytes step) : 14581.4 MB/s
C copy prefetched (64 bytes step) : 14613.8 MB/s
C 2-pass copy : 10819.4 MB/s
C 2-pass copy prefetched (32 bytes step) : 11313.6 MB/s
C 2-pass copy prefetched (64 bytes step) : 11417.4 MB/s
C fill : 31260.2 MB/s
C fill (shuffle within 16 byte blocks) : 31257.1 MB/s
C fill (shuffle within 32 byte blocks) : 31263.1 MB/s
C fill (shuffle within 64 byte blocks) : 31260.9 MB/s
NEON 64x2 COPY : 14464.3 MB/s (0.9%)
NEON 64x2x4 COPY : 13694.9 MB/s
NEON 64x1x4_x2 COPY : 12444.6 MB/s
NEON 64x2 COPY prefetch x2 : 14886.9 MB/s
NEON 64x2x4 COPY prefetch x1 : 14954.4 MB/s
NEON 64x2 COPY prefetch x1 : 14892.3 MB/s
NEON 64x2x4 COPY prefetch x1 : 14955.5 MB/s
---
standard memcpy : 14141.9 MB/s
standard memset : 31268.0 MB/s
---
NEON LDP/STP copy : 13775.1 MB/s (0.7%)
NEON LDP/STP copy pldl2strm (32 bytes step) : 14267.3 MB/s
NEON LDP/STP copy pldl2strm (64 bytes step) : 14340.9 MB/s
NEON LDP/STP copy pldl1keep (32 bytes step) : 14670.0 MB/s
NEON LDP/STP copy pldl1keep (64 bytes step) : 14644.7 MB/s
NEON LD1/ST1 copy : 13756.1 MB/s
NEON STP fill : 31262.2 MB/s
NEON STNP fill : 31265.7 MB/s
ARM LDP/STP copy : 14454.0 MB/s (0.6%)
ARM STP fill : 31265.6 MB/s
ARM STNP fill : 31266.0 MB/s
==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can't handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==
==========================================================================
block size : single random read / dual random read, [MADV_NOHUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 1.1 ns / 1.6 ns
262144 : 1.7 ns / 2.0 ns
524288 : 1.9 ns / 2.2 ns
1048576 : 2.1 ns / 2.2 ns
2097152 : 3.0 ns / 3.3 ns
4194304 : 22.6 ns / 33.9 ns
8388608 : 33.7 ns / 44.3 ns
16777216 : 39.3 ns / 48.0 ns
33554432 : 42.1 ns / 49.4 ns
67108864 : 49.0 ns / 60.2 ns
block size : single random read / dual random read, [MADV_HUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 1.1 ns / 1.6 ns
262144 : 1.7 ns / 2.0 ns
524288 : 1.9 ns / 2.2 ns
1048576 : 2.1 ns / 2.2 ns
2097152 : 3.0 ns / 3.3 ns
4194304 : 22.6 ns / 33.9 ns
8388608 : 33.7 ns / 44.3 ns
16777216 : 39.3 ns / 47.9 ns
33554432 : 42.1 ns / 49.4 ns
67108864 : 49.9 ns / 61.9 ns
sbc-bench
results
Run sbc-bench and paste a link to the results here: https://0x0.st/X0gc.bin
See: ThomasKaiser/sbc-bench#105
Phoronix Test Suite
Results from pi-general-benchmark.sh:
- pts/encode-mp3: 11.248 sec
- pts/x264 4K: 69.49 fps
- pts/x264 1080p: 160.75 fps
- pts/phpbench: 567108
- pts/build-linux-kernel (defconfig): 50.101 sec
Additional benchmarks
QEMU Coremark
The Ampere team have suggested running this, as it will emulate running tons of virtual instances with coremark inside, a good proxy of the type of performance you can get with VMs/containers on this system: https://github.com/AmpereComputing/qemu-coremark
ubuntu@ubuntu:~/qemu-coremark$ ./run_pts.sh 2
47 instances of pts/coremark running in parallel in arm64 VMs!
Round 1 - Total CoreMark Score is: 4697344
Round 2 - Total CoreMark Score is: 4684524
llama.cpp (Ampere-optimized)
See: https://github.com/AmpereComputingAI/llama.cpp (I also have an email from Ampere with some testing notes).
Ollama (generic LLMs)
See: https://github.com/geerlingguy/ollama-benchmark?tab=readme-ov-file#findings
System | CPU/GPU | Model | Eval Rate |
---|---|---|---|
AmpereOne A192-32X (192 core - 512GB) | CPU | llama3.2:3b | 23.52 Tokens/s |
AmpereOne A192-32X (192 core - 512GB) | CPU | llama3.1:8b | 17.47 Tokens/s |
AmpereOne A192-32X (192 core - 512GB) | CPU | llama3.1:70b | 3.86 Tokens/s |
AmpereOne A192-32X (192 core - 512GB) | CPU | llama3.1:405b | 0.90 Tokens/s |
yolo-v5
See: https://github.com/AmpereComputingAI/yolov5-demo (maybe test it on a 4K60 video, see how it fares).