Description
Hi, I'm playing with RecursiveFactorization and am finding that single precision factorizations are much faster than the cubic scaling would predict. Float16 seems slower than the prediction as well. I was hoping the Apple silicon and Julia support for Float16 in hardware would show better, but it is still far better than using OpenBlas LU in Float16.
I'd think that time(double) = 2 x time(single) = 4 x time(half)
would be the case for larger problems, but am not seeing that. Is there an algorithmic reason for this? Should I be setting parameters in the factorization to something other than the default values?
I ran into this while using this package for my own work.
This post is my happy story.
If you have insight or advice, I'd like that. The results come from a M2 Pro Mac and Julia 1.10-beta 1.
I ran this script (see below) to factor a random matrix of size N at three precisions and got this
julia> speed_test(2048)
Time for Float64 = 3.22886e-01
Time for Float32 = 1.70550e-02
Time for Float16 = 5.77817e-02
julia> speed_test(4096)
Time for Float64 = 3.68761e+00
Time for Float32 = 2.47510e-01
Time for Float16 = 4.73640e-01
julia> speed_test(8192)
Time for Float64 = 3.87592e+01
Time for Float32 = 3.12564e+00
Time for Float16 = 6.68709e+00
using Random
using BenchmarkTools
using RecursiveFactorization
using LinearAlgebra
function speed_test(N=2048)
Random.seed!(46071)
A=rand(N,N)
for T in [Float64, Float32, Float16]
AT=T.(copy(A))
tlu=@belapsed rlu($AT)
println("Time for $T = $tlu")
end
end
function rlu(C)
B = copy(C)
CF = RecursiveFactorization.lu!(B, Vector{Int}(undef, size(B, 2)))
return CF
end