Unit Tests

I think there are two kinds of unit tests that distributions must pass to be included.
1. An "identity" test confirms that the distribution is the one it is purported to be. These tests are written by the _reviewer_ based on the primary reference provided by the distribution author.
2. A generic suite of "consistency" tests provided by `scikit-stats` confirms that all methods of the distribution are consistent with one another.

For instance:
1. An identity test of the normal distribution might compare the output of the `pdf` method against tabulated reference values or a very simple, independently written, and easily-verified implementation included in the test suite.
2. If the distribution overrides `_pdf_formula` and `_cdf_formula`, we test that `X.cdf(x, method='formula')` is close to `X.cdf(x, method='quadrature')` (computed from the PDF); if `_logcdf_formula` is also defined, `X.logcdf(x, method='formula')` is tested against `X.cdf(x, method='logexp')`.

The idea is that if 1) _one_ of the overridden methods is checked against an external reference and 2) all the other overridden methods are checked against one another, there is no need to test the other overridden methods against an external reference. In fact, to save developer and reviewer time, *only one* of the overridden methods should be tested against an external reference in a distribution-specific, and the rest of the tests should be generic. If the author or reviewer is concerned that this is not thorough enough, the solution is to improve the generic tests, not add tests against external references.

---

[`scipy/stats/tests/test_coninuous.py::TestDistributions`](https://github.com/scipy/scipy/blob/639bdb83ef66dfa49158edecafffb3eb463c9151/scipy/stats/tests/test_continuous.py#L165) implements this strategy, and here are some things I like about how it turned out:
- The shape parameters are randomly drawn from a domain defined in the code of the distribution, and the arguments are randomly drawn from the support of the distribution. In addition, parameters/arguments are drawn from the boundaries of their domain/support to check and *outside* their domain/support to check edge case behavior. We get good coverage by drawing values stochastically from an appropriate distribution, not by hand-selecting numbers.
- The generic tests are very compact. For instance, one test, [`test_funcs`](https://github.com/scipy/scipy/blob/639bdb83ef66dfa49158edecafffb3eb463c9151/scipy/stats/tests/test_continuous.py#L209), is parametrized over all distribution families and most of the methods (from `entropy` to `ilogccdf`), and it compares results with all the values of the `method` keyword argument (e.g. `'formula'`, `'log/exp'`, `'complement'`, `'quadrature'`) against one another in a loop. 
- The generic tests are quite thorough. Sometimes they are redundant because leaving them redundant is *simpler* than the alternative. For instance, I have no problem with testing both `allclose(X.cdf(x, method='formula'), `X.ccdf(x, method='complement'))` _and_ `allclose(X.ccdf(x, method='formula'), `X.cdf(x, method='complement'))` rather than choosing only one. The tests can take as long as they want to run. The important thing is that we trust them to catch inconsistencies (within the tested domain) if there are any.

Here are some things I don't like about how it turned out:
- The relative frequency of within domain/support, on the edge of domain/support, outside domain/support, and NaN values to be tested needs to be considered more carefully. For instance, when a distribution has several shape parameters and each of them has a fixed probability of being outside the domain, the probability that *at least* one is outside the domain can get quite high, and we could wind up up testing mostly that NaN = NaN.
- Sometimes generic tests fail through no fault of the distribution implementation. For instance, the generic method for computing the mode is numerical optimization of the PDF (`method='optimization'`). For the `Uniform` distribution, the `_mode_formula` returns `(a + b)/2`, but numerical optimization would not be wrong to pick any other point. As another example, the generic method for computing the CDF is numerical integration of the PDF (`method='quadrature'`). If the PDF is tricky to integrate, occasionally `hypothesis` will find a case that fails to meet tolerances. We need to come up with a strategy that makes it easy to allow for occasional failures without every one of them requiring developer attention.
- Parameterization is nice because it makes the tests relatively concise and easy to maintain, but when a failure does occur, it is hard to reproduce the failure even with the `hypothesis` `reproduce_failure` decorator. There should be an easy way of reproducing failures *outside* of the test suite so we can more easily poke at the situation.

For these reasons, I am leaning toward *not* using `hypothesis` and instead writing our own lightweight framework for the property-based tests we need. 

Note that the tests described here are not intended to check the precision of the results or reliability for extreme inputs. Instead, these metrics will be assessed with *benchmarks* which do not "pass" or "fail". Instead, the results of the benchmarks are clearly communicated to the user so they know what sort of accuracy they can expect within a given domain. I'll write about that in a separate issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unit Tests #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unit Tests #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions