Better Workflow Surrounding Specification of `Y` Argument

## Background
The ![Modeling Heteroscedasticity with BART example](https://www.pymc.io/projects/examples/en/latest/case_studies/bart_heteroscedasticity.html) showcases two amazing abilities of PyMC-BART:

1) The ability to have non-normal error distributions for BART RV's like the `pm.Gamma` distribution that is used, and
2) the ability to model functions other than the mean like the modeling of non-constant variance.

While the first ability is easily understood and a small step from existing work, the latter ability is both novel and amazing.  I would love to use this ability in my work, I just think the implementation feels a little wonky to me for lack of a better word.  Additionally, the documentation in pymc-bart’s paper [Quiroga et al., 2022] does not completely explain how this works.  

## The Issue
It is my understanding that when growing trees for computing estimated variance, every proposed tree has leaf values drawn from $N(\mu_{pred},\epsilon^2)$ where $\mu_{pred}$ is computed as the mean of the current sum of trees divided by the number of trees $m$.  The problem I see is that $\mu_{pred}$ is (at least initially) based off of $Y$ values that seem inappropriate as initial values; the $Y$ values in the example are initial guesses at "sales" and are likely to be very BAD initial guesses for "standard deviation of sales".

For the ![example](https://www.pymc.io/projects/examples/en/latest/case_studies/bart_heteroscedasticity.html), these initial guesses of "sales" are much higher than one would guess for initial guesses of standard deviation.  Hence, in this picture:

![](https://www.pymc.io/projects/examples/en/latest/_images/363707ca9a601fce00b3f64c82ced18e397727a2b4b3322bb17930494a8b9687.png)

it is not surprising that the 94% HDI is too wide.  We should expect around 12 observations falling outside of the 94% HDI band, but in reality only 2 or 3 observations fall outside of the band.  I am guessing the estimate of standard deviation is systematically too high because of the initial conditions.

## Thoughts on Implementation
So, with all of the above I am requesting:
1) Allow for a PYMC RV prior (or multiple priors when `size > 1`) to be specified on leaf-node values, and/or
2) Better document the existing functionality on how leaf-nodes are computed so at least the mechanism is more transperent.

Thanks for all the hard-work, plugging BART models in as components of larger probabilistic programs is very much a winning idea and should dominate applied workflows where uncertainty quantification is important.  I would love to see this cleaned up a little bit and I am happy to help with documentation or code changes, but need more direction on the math/sampling side of things.  Thanks again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Better Workflow Surrounding Specification of `Y` Argument #122

Background

The Issue

Thoughts on Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Better Workflow Surrounding Specification of Y Argument #122

Description

Background

The Issue

Thoughts on Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Better Workflow Surrounding Specification of `Y` Argument #122