Skip to content

Benchmarking and Accuracy Documentation #7

@mdhaber

Description

@mdhaber

Accurate evaluation of special functions (e.g. distribution functions, moments, entropy) with finite precision arithmetic is hard, and expecting results accurate to machine precision for arbitrary machine-representable inputs is impractical. At the time of writing, many SciPy stats and special functions produce inaccurate results without any warnings in the documentation or during execution, users typically report accuracy problems individually as "bugs", and occasionally we make improvements. Historically, there has not been a long-term, birds-eye plan. We want to do better by:

  • providing documentation about the relationship between input values and result accuracy for each method of each distribution,
  • returning error estimates where possible,
  • returning NaNs rather than garbage, and
  • tracking the accuracy over time with benchmarks.

Only results that are inconsistent with the documented accuracy will be "bugs"; the rest is opportunity for improvement.

There are two things that need to happen in SciPy before we can completely satisfy these goals.

  • The distribution infrastructure needs a way of returning error estimate information. For default implementations of methods, this could be as simple as returning the error estimate provided by the numerical quadrature/inversion/minimization routine, assuming that the inaccuracy of the underlying functions is small in comparison. (I don't think it's practical to stack errors with interval arithmetic or anything.) However, the idea of adding an error attribute to the returned array has not really caught on, so we just don't have a way of doing that.
  • The distribution infrastructure needs to be translated to the array API so we can run mparrays through it. The idea is that we can (at least partially) automate the process of generating reference values by evaluating functions with arbitrary precision arithmetic.

In the meantime, though, I think we can plan for what we want to do with those features when they are available. For instance, we can manually add ReferenceDistributions, and use those to run benchmarks and document method accuracy.

I don't have a complete picture of what those benchmarks and documentation will look like. I have a vague notion of some aspects.

  • Documentation might contain statements like "Within subdomain $D$ (defined by e.g. a hyperrectangle), method will produce results within relative error $\epsilon_r$ and absolute error $\epsilon_a$ with probability $p$" (where $p$ is some pre-defined reliability, like 99.99%). Presumably, there will be some small, simple to define base region with good accuracy (e.g. " $x \in [10^{-10}, 10^{10}]$, $c \in [1, 10^{10}]$ "), and maybe outer subdomains and hot spots with reduced accuracy.
  • Documentation might contain plots of error over the domain like these, which we could produce during benchmark runs. We need some tools to make it easy to generate standard ones.
  • We can make it easy for users to run benchmarks for themselves. It would be super cool to have an interactive, in-browser tool that investigates accuracy within a particular region (e.g. see plots above and imagine generating them interactively). @tupui is that in your wheelhouse?

For some inspiration, see how Boost presents accuracy information, e.g. here. (However, I'd like to go quite a bit beyond that.) We should also coordinate with scipy/xsf to avoid duplicating effort, since I think we'lll be trying to solve similar problems.

In any case, clearly there is a lot we can do while waiting for features in the infrastructure!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions