-
Notifications
You must be signed in to change notification settings - Fork 1
Description
I think there are two kinds of unit tests that distributions must pass to be included.
- An "identity" test confirms that the distribution is the one it is purported to be. These tests are written by the reviewer based on the primary reference provided by the distribution author.
- A generic suite of "consistency" tests provided by
scikit-stats
confirms that all methods of the distribution are consistent with one another.
For instance:
- An identity test of the normal distribution might compare the output of the
pdf
method against tabulated reference values or a very simple, independently written, and easily-verified implementation included in the test suite. - If the distribution overrides
_pdf_formula
and_cdf_formula
, we test thatX.cdf(x, method='formula')
is close toX.cdf(x, method='quadrature')
(computed from the PDF); if_logcdf_formula
is also defined,X.logcdf(x, method='formula')
is tested againstX.cdf(x, method='logexp')
.
The idea is that if 1) one of the overridden methods is checked against an external reference and 2) all the other overridden methods are checked against one another, there is no need to test the other overridden methods against an external reference. In fact, to save developer and reviewer time, only one of the overridden methods should be tested against an external reference in a distribution-specific, and the rest of the tests should be generic. If the author or reviewer is concerned that this is not thorough enough, the solution is to improve the generic tests, not add tests against external references.
scipy/stats/tests/test_coninuous.py::TestDistributions
implements this strategy, and here are some things I like about how it turned out:
- The shape parameters are randomly drawn from a domain defined in the code of the distribution, and the arguments are randomly drawn from the support of the distribution. In addition, parameters/arguments are drawn from the boundaries of their domain/support to check and outside their domain/support to check edge case behavior. We get good coverage by drawing values stochastically from an appropriate distribution, not by hand-selecting numbers.
- The generic tests are very compact. For instance, one test,
test_funcs
, is parametrized over all distribution families and most of the methods (fromentropy
toilogccdf
), and it compares results with all the values of themethod
keyword argument (e.g.'formula'
,'log/exp'
,'complement'
,'quadrature'
) against one another in a loop. - The generic tests are quite thorough. Sometimes they are redundant because leaving them redundant is simpler than the alternative. For instance, I have no problem with testing both
allclose(X.cdf(x, method='formula'),
X.ccdf(x, method='complement'))_and_
allclose(X.ccdf(x, method='formula'),X.cdf(x, method='complement'))
rather than choosing only one. The tests can take as long as they want to run. The important thing is that we trust them to catch inconsistencies (within the tested domain) if there are any.
Here are some things I don't like about how it turned out:
- The relative frequency of within domain/support, on the edge of domain/support, outside domain/support, and NaN values to be tested needs to be considered more carefully. For instance, when a distribution has several shape parameters and each of them has a fixed probability of being outside the domain, the probability that at least one is outside the domain can get quite high, and we could wind up up testing mostly that NaN = NaN.
- Sometimes generic tests fail through no fault of the distribution implementation. For instance, the generic method for computing the mode is numerical optimization of the PDF (
method='optimization'
). For theUniform
distribution, the_mode_formula
returns(a + b)/2
, but numerical optimization would not be wrong to pick any other point. As another example, the generic method for computing the CDF is numerical integration of the PDF (method='quadrature'
). If the PDF is tricky to integrate, occasionallyhypothesis
will find a case that fails to meet tolerances. We need to come up with a strategy that makes it easy to allow for occasional failures without every one of them requiring developer attention. - Parameterization is nice because it makes the tests relatively concise and easy to maintain, but when a failure does occur, it is hard to reproduce the failure even with the
hypothesis
reproduce_failure
decorator. There should be an easy way of reproducing failures outside of the test suite so we can more easily poke at the situation.
For these reasons, I am leaning toward not using hypothesis
and instead writing our own lightweight framework for the property-based tests we need.
Note that the tests described here are not intended to check the precision of the results or reliability for extreme inputs. Instead, these metrics will be assessed with benchmarks which do not "pass" or "fail". Instead, the results of the benchmarks are clearly communicated to the user so they know what sort of accuracy they can expect within a given domain. I'll write about that in a separate issue.