Skip to content

Better Workflow Surrounding Specification of Y Argument #122

@flyaflya

Description

@flyaflya

Background

The Modeling Heteroscedasticity with BART example showcases two amazing abilities of PyMC-BART:

  1. The ability to have non-normal error distributions for BART RV's like the pm.Gamma distribution that is used, and
  2. the ability to model functions other than the mean like the modeling of non-constant variance.

While the first ability is easily understood and a small step from existing work, the latter ability is both novel and amazing. I would love to use this ability in my work, I just think the implementation feels a little wonky to me for lack of a better word. Additionally, the documentation in pymc-bart’s paper [Quiroga et al., 2022] does not completely explain how this works.

The Issue

It is my understanding that when growing trees for computing estimated variance, every proposed tree has leaf values drawn from $N(\mu_{pred},\epsilon^2)$ where $\mu_{pred}$ is computed as the mean of the current sum of trees divided by the number of trees $m$. The problem I see is that $\mu_{pred}$ is (at least initially) based off of $Y$ values that seem inappropriate as initial values; the $Y$ values in the example are initial guesses at "sales" and are likely to be very BAD initial guesses for "standard deviation of sales".

For the example, these initial guesses of "sales" are much higher than one would guess for initial guesses of standard deviation. Hence, in this picture:

it is not surprising that the 94% HDI is too wide. We should expect around 12 observations falling outside of the 94% HDI band, but in reality only 2 or 3 observations fall outside of the band. I am guessing the estimate of standard deviation is systematically too high because of the initial conditions.

Thoughts on Implementation

So, with all of the above I am requesting:

  1. Allow for a PYMC RV prior (or multiple priors when size > 1) to be specified on leaf-node values, and/or
  2. Better document the existing functionality on how leaf-nodes are computed so at least the mechanism is more transperent.

Thanks for all the hard-work, plugging BART models in as components of larger probabilistic programs is very much a winning idea and should dominate applied workflows where uncertainty quantification is important. I would love to see this cleaned up a little bit and I am happy to help with documentation or code changes, but need more direction on the math/sampling side of things. Thanks again.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions