Skip to content

AmpereOne A192-32X (Supermicro) #52

@geerlingguy

Description

@geerlingguy

DSC01611

Basic information

Linux/system information

# output of `screenfetch`
ubuntu@ubuntu:~$ screenfetch 
                          ./+o+-       ubuntu@ubuntu
                  yyyyy- -yyyyyy+      OS: Ubuntu 24.04 noble
               ://+//////-yyyyyyo      Kernel: aarch64 Linux 6.8.0-39-generic-64k
           .++ .:/++++++/-.+sss/`      Uptime: 23m
         .:++o:  /++++++++/:--:/-      Packages: 810
        o:+o+:++.`..```.-/oo+++++/     Shell: bash 5.2.21
       .:+o:+o/.          `+sssoo+/    Disk: 19G / 101G (20%)
  .++/+:+oo+o:`             /sssooo.   CPU: Ampere Ampere-1a @ 192x 3.2GHz
 /+++//+:`oo+o               /::--:.   GPU: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52)
 \+/+o+++`o++o               ++////.   RAM: 31390MiB / 522867MiB
  .++.o+++oo+:`             /dddhhh.  
       .+.o+oo:.          `oddhhhh+   
        \+.++o+o``-````.:ohdhhhhh+    
         `:o+++ `ohhhhhhhhyo++os:     
           .o:`.syhhhhhhh/.oo++o`     
               /osyyyyyyo++ooo+++/    
                   ````` +oo+++o\:    
                          `oo++.     

# output of `uname -a`
Linux ubuntu 6.8.0-39-generic-64k #39-Ubuntu SMP PREEMPT_DYNAMIC Sat Jul  6 11:08:16 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Benchmark results

CPU

Power

  • Idle power draw (at wall): 199 W (30 W CPU / 78 W IO - 108W SoC package power from sensors)
  • Maximum simulated power draw (stress-ng --matrix 0): 500 W
  • During Geekbench multicore benchmark: 300-600 W (depending on Geekbench version)
  • During top500 HPL benchmark: 692 W

Disk

Samsung NVMe SSD - 983 DCT M.2 960GB

Benchmark Result
iozone 4K random read 50.35 MB/s
iozone 4K random write 216.04 MB/s
iozone 1M random read 2067.82 MB/s
iozone 1M random write 1295.13 MB/s
iozone 1M sequential read 2098.31 MB/s
iozone 1M sequential write 1291.07 MB/s
wget https://raw.githubusercontent.com/geerlingguy/pi-cluster/master/benchmarks/disk-benchmark.sh
chmod +x disk-benchmark.sh
sudo MOUNT_PATH=/ TEST_SIZE=1g ./disk-benchmark.sh

Samsung NVMe SSD - MZQL21T9HCJR-00A07

Specs: https://semiconductor.samsung.com/ssd/datacenter-ssd/pm9a3/mzql21t9hcjr-00a07/

Single disk
Benchmark Result
iozone 4K random read 60.19 MB/s
iozone 4K random write 284.72 MB/s
iozone 1M random read 3777.29 MB/s
iozone 1M random write 2686.80 MB/s
iozone 1M sequential read 3773.44 MB/s
iozone 1M sequential write 2680.90 MB/s
RAID 0 (mdadm)
Benchmark Result
iozone 4K random read 58.05 MB/s
iozone 4K random write 250.06 MB/s
iozone 1M random read 5444.03 MB/s
iozone 1M random write 4411.07 MB/s
iozone 1M sequential read 7120.75 MB/s
iozone 1M sequential write 4458.30 MB/s

Network

iperf3 results:

  • iperf3 -c $SERVER_IP: 21.4 Gbps
  • iperf3 -c $SERVER_IP --reverse: 18.8 Gbps
  • iperf3 -c $SERVER_IP --bidir: 8.08 Gbps up, 22.2 Gbps down

Tested on one of the two built-in Broadcom BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller interfaces, to my HL15 Arm NAS (see: geerlingguy/arm-nas#16), routed through a Mikrotik 25G Cloud Router.

GPU

Did not test - this server doesn't have a GPU, just the ASPEED integrated BMC VGA graphics, which are not suitable for much GPU-accelerated gaming or LLMs, lol. Just render it on CPU!

Memory

tinymembench results:

Click to expand memory benchmark result
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :  14199.7 MB/s (0.3%)
 C copy backwards (32 byte blocks)                    :  13871.7 MB/s
 C copy backwards (64 byte blocks)                    :  13879.6 MB/s (0.2%)
 C copy                                               :  13890.6 MB/s (0.2%)
 C copy prefetched (32 bytes step)                    :  14581.4 MB/s
 C copy prefetched (64 bytes step)                    :  14613.8 MB/s
 C 2-pass copy                                        :  10819.4 MB/s
 C 2-pass copy prefetched (32 bytes step)             :  11313.6 MB/s
 C 2-pass copy prefetched (64 bytes step)             :  11417.4 MB/s
 C fill                                               :  31260.2 MB/s
 C fill (shuffle within 16 byte blocks)               :  31257.1 MB/s
 C fill (shuffle within 32 byte blocks)               :  31263.1 MB/s
 C fill (shuffle within 64 byte blocks)               :  31260.9 MB/s
 NEON 64x2 COPY                                       :  14464.3 MB/s (0.9%)
 NEON 64x2x4 COPY                                     :  13694.9 MB/s
 NEON 64x1x4_x2 COPY                                  :  12444.6 MB/s
 NEON 64x2 COPY prefetch x2                           :  14886.9 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  14954.4 MB/s
 NEON 64x2 COPY prefetch x1                           :  14892.3 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  14955.5 MB/s
 ---
 standard memcpy                                      :  14141.9 MB/s
 standard memset                                      :  31268.0 MB/s
 ---
 NEON LDP/STP copy                                    :  13775.1 MB/s (0.7%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :  14267.3 MB/s
 NEON LDP/STP copy pldl2strm (64 bytes step)          :  14340.9 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :  14670.0 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :  14644.7 MB/s
 NEON LD1/ST1 copy                                    :  13756.1 MB/s
 NEON STP fill                                        :  31262.2 MB/s
 NEON STNP fill                                       :  31265.7 MB/s
 ARM LDP/STP copy                                     :  14454.0 MB/s (0.6%)
 ARM STP fill                                         :  31265.6 MB/s
 ARM STNP fill                                        :  31266.0 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.1 ns          /     1.6 ns 
    262144 :    1.7 ns          /     2.0 ns 
    524288 :    1.9 ns          /     2.2 ns 
   1048576 :    2.1 ns          /     2.2 ns 
   2097152 :    3.0 ns          /     3.3 ns 
   4194304 :   22.6 ns          /    33.9 ns 
   8388608 :   33.7 ns          /    44.3 ns 
  16777216 :   39.3 ns          /    48.0 ns 
  33554432 :   42.1 ns          /    49.4 ns 
  67108864 :   49.0 ns          /    60.2 ns 

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.1 ns          /     1.6 ns 
    262144 :    1.7 ns          /     2.0 ns 
    524288 :    1.9 ns          /     2.2 ns 
   1048576 :    2.1 ns          /     2.2 ns 
   2097152 :    3.0 ns          /     3.3 ns 
   4194304 :   22.6 ns          /    33.9 ns 
   8388608 :   33.7 ns          /    44.3 ns 
  16777216 :   39.3 ns          /    47.9 ns 
  33554432 :   42.1 ns          /    49.4 ns 
  67108864 :   49.9 ns          /    61.9 ns 

sbc-bench results

Run sbc-bench and paste a link to the results here: https://0x0.st/X0gc.bin

See: ThomasKaiser/sbc-bench#105

Phoronix Test Suite

Results from pi-general-benchmark.sh:

  • pts/encode-mp3: 11.248 sec
  • pts/x264 4K: 69.49 fps
  • pts/x264 1080p: 160.75 fps
  • pts/phpbench: 567108
  • pts/build-linux-kernel (defconfig): 50.101 sec

Additional benchmarks

QEMU Coremark

The Ampere team have suggested running this, as it will emulate running tons of virtual instances with coremark inside, a good proxy of the type of performance you can get with VMs/containers on this system: https://github.com/AmpereComputing/qemu-coremark

ubuntu@ubuntu:~/qemu-coremark$ ./run_pts.sh 2
47 instances of pts/coremark running in parallel in arm64 VMs!
Round 1 - Total CoreMark Score is: 4697344
Round 2 - Total CoreMark Score is: 4684524

llama.cpp (Ampere-optimized)

See: https://github.com/AmpereComputingAI/llama.cpp (I also have an email from Ampere with some testing notes).

Ollama (generic LLMs)

See: https://github.com/geerlingguy/ollama-benchmark?tab=readme-ov-file#findings

System CPU/GPU Model Eval Rate
AmpereOne A192-32X (192 core - 512GB) CPU llama3.2:3b 23.52 Tokens/s
AmpereOne A192-32X (192 core - 512GB) CPU llama3.1:8b 17.47 Tokens/s
AmpereOne A192-32X (192 core - 512GB) CPU llama3.1:70b 3.86 Tokens/s
AmpereOne A192-32X (192 core - 512GB) CPU llama3.1:405b 0.90 Tokens/s

yolo-v5

See: https://github.com/AmpereComputingAI/yolov5-demo (maybe test it on a 4K60 video, see how it fares).

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions