Skip to content

Diagonal Matrix Mult is Slower than Dense #1483

@andrew-saydjari

Description

@andrew-saydjari

In trying a MWE for my core functionality, I have noticed that a simple matrix vector multiply using a Diagonal matrix is more than 5x slower than the dense matrix-vector multiply. One can of course convert this to a broadcasted vector multiply, where by I can get Reactant to within 20% of my standard CPU performance, but it would be nice to keep the same syntax and have Reactant just handle Diagonal types more seemlessly. My motivation here is that my actual production code uses custom types that are low rank representations of the matrices with overloaded definitions of matrix multiplications, that often involve diagonal components.

using Reactant, Random, LinearAlgebra, PrettyChairmarks

function core_mul_dense(AinvVIinvtX,AinvVIinv,VAinvVIinvtX,Ainv,V,x,out)
    mul!(AinvVIinvtX,AinvVIinv',x)
    mul!(VAinvVIinvtX,V,AinvVIinvtX,-1,0)
    VAinvVIinvtX .+= x
    mul!(out,Ainv,VAinvVIinvtX)
    return
end

rng = Xoshiro(122)
n = 8000
ln = 50
x = randn(rng,n)
AinvVIinv = randn(rng,n,ln)
V = randn(rng,n,ln)
Ax = randn(rng,n)
Ainv = Diagonal(Ax)
AinvDense = Matrix(Ainv)
Ainv_vec = reshape(Ax,:,1)
AinvVIinvtX = zeros(ln)
VAinvVIinvtX = zeros(n)
out = zeros(n);

## Diagonal matrix-vector multiply
@bs core_mul_dense(AinvVIinvtX,AinvVIinv,VAinvVIinvtX,Ainv,V,x,out) seconds=3

# Chairmarks.Benchmark: 9585 samples with 1 evaluation.
#  Range (min … max):  234.913 μs …   4.769 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     288.765 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   298.449 μs ± 124.596 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

## Fully dense matrix-vector multiply
@bs core_mul_dense(AinvVIinvtX,AinvVIinv,VAinvVIinvtX,AinvDense,V,x,out) seconds=10

# Chairmarks.Benchmark: 202 samples with 1 evaluation.
#  Range (min … max):  44.806 ms … 56.304 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     46.955 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   47.470 ms ±  1.750 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

Reactant.set_default_backend("cpu")

xR = Reactant.ConcreteRArray(x)
AinvVIinvR = Reactant.ConcreteRArray(AinvVIinv)
VR = Reactant.ConcreteRArray(V)
AinvDenseR = Reactant.ConcreteRArray(AinvDense)
AinvVIinvtXR = Reactant.ConcreteRArray(AinvVIinvtX)
VAinvVIinvtXR = Reactant.ConcreteRArray(VAinvVIinvtX)
outR = Reactant.ConcreteRArray(out);

f = @compile sync=true core_mul_dense(AinvVIinvtXR,AinvVIinvR,VAinvVIinvtXR,AinvDenseR,VR,xR,outR)

## Reactant fully dense matrix-vector multiply
@bs f(AinvVIinvtXR,AinvVIinvR,VAinvVIinvtXR,AinvDenseR,VR,xR,outR) seconds=10

# Chairmarks.Benchmark: 204 samples with 1 evaluation.
#  Range (min … max):  42.615 ms … 60.736 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     44.233 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   46.662 ms ±  4.344 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

## Reactant diagonal matrix-vector multiply
AinvR = Reactant.to_rarray(Ainv);

f = @compile sync=true core_mul_dense(AinvVIinvtXR,AinvVIinvR,VAinvVIinvtXR,AinvR,VR,xR,outR)

@bs f(AinvVIinvtXR,AinvVIinvR,VAinvVIinvtXR,AinvR,VR,xR,outR) seconds=10

# Chairmarks.Benchmark: 37 samples with 1 evaluation.
#  Range (min … max):  239.017 ms … 394.568 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     246.455 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   273.044 ms ±  49.257 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions