Skip to content

BART model save, reload and new predictions #123

@twj8CDC

Description

@twj8CDC

Hi, I have been trying to save, reload and generate new predictions with a model that includes a BARTRV.

I am able to save the trace as a pickle (net_cdf works too), and then instantiate a new model and get the posterior predictions on the training data, but when I try to add new data I get shape errors. The shape errors are odd since when I train the model I can update the model with new data for predictions without any issues. It is only when I use the newly instantiated model that I am unable to update the input data.

Below is a minimal example:

from pathlib import Path

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import pymc_bart as pmb

import cloudpickle as cpkl
import dill 
print(f"Running on PyMC v{pm.__version__}")

print(f"Running on PyMC-BART v{pmb.__version__}")
try:
    bikes = pd.read_csv(Path("..", "data", "bikes.csv"))
except FileNotFoundError:
    bikes = pd.read_csv(pm.get_data("bikes.csv"))

features = ["hour", "temperature", "humidity", "workingday"]

X = bikes[features]
Y = bikes["count"]

xt = X[0:10]
yt = Y[0:10]
with pm.Model() as model_bikes:
    xdata = pm.MutableData("xdata", X)
    a = pm.Exponential("a", 1)
    mu_ = pmb.BART("mu_", xdata, np.log(Y), m=20)
    mu = pm.Deterministic("mu", pm.math.exp(mu_))
    y = pm.NegativeBinomial("y", mu=mu, alpha=a, observed=Y, shape=xdata.shape[0])
    idata_bikes = pm.sample(random_seed=99, draws=100, tune=100, compute_convergence_checks=False)
idata_bikes

Pickle instead of netcdf, but this seems to work fine

# # pickle
with open('test4.pkl', mode='wb') as file:
    cpkl.dump(idata_bikes, file)

with open("test4.pkl", mode="rb") as file:
    idata4 = cpkl.load(file)

Posterior predictions on updated data works with OG model with the OG idata and the saved and loaded idata

with model_bikes:
    pm.set_data({"xdata": xt})
    post1 = pm.sample_posterior_predictive(idata_bikes, var_names=["mu", "y"])

with model_bikes:
    pm.set_data({"xdata": xt})
    post2 = pm.sample_posterior_predictive(idata4, var_names=["mu", "y"])
print(post1.posterior_predictive["mu"].values.mean((0,1)))
print(post2.posterior_predictive["mu"].values.mean((0,1)))

Restart the session to test the load from a clean slate and reload the data from above

from pathlib import Path

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import pymc_bart as pmb

import cloudpickle as cpkl
import dill 
print(f"Running on PyMC v{pm.__version__}")

print(f"Running on PyMC-BART v{pmb.__version__}")
try:
    bikes = pd.read_csv(Path("..", "data", "bikes.csv"))
except FileNotFoundError:
    bikes = pd.read_csv(pm.get_data("bikes.csv"))

features = ["hour", "temperature", "humidity", "workingday"]

X = bikes[features]
Y = bikes["count"]

xt = X[0:10]
yt = Y[0:10]

Specify the new model. Only difference is variable names

with pm.Model() as model2:
    xdata2 = pm.MutableData("xdata", X)
    a2 = pm.Exponential("a", 1)
    mu_2 = pmb.BART("mu_", xdata2, np.log(Y), m=50)
    mu2 = pm.Deterministic("mu", pm.math.exp(mu_2))
    y2 = pm.NegativeBinomial("y", mu=mu2, alpha=a2, observed=Y, shape=xdata2.shape[0])

load the saved idata

with open("test4.pkl", mode="rb") as file:
    idata4 = cpkl.load(file)

get posterior predictions on the training data

with model2:
    post5 = pm.sample_posterior_predictive(idata4, var_names=["mu", "y"], )

This works minus a slight difference in predicted values, possible due to a difference in random state? The post5 compares well to the post1 and post2 above.

get the poster predictions with new data

with model2:
    pm.set_data({"xdata": xt})
    post4 = pm.sample_posterior_predictive(idata4, var_names=["mu", "y"])

This fails with the following error

"name": "ValueError",
	"message": "size does not match the broadcast shape of the parameters. (10,), (10,), (348,)\nApply node that caused the error: nbinom_rv{0, (0, 0), int64, True}(RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F1CCA3F2A40>), MakeVector{dtype='int64'}.0, 4, a, Composite{...}.1)\nToposort index: 5\nInputs types: [RandomGeneratorType, TensorType(int64, shape=(1,)), TensorType(int64, shape=()), TensorType(float64, shape=()), TensorType(float64, shape=(None,))]\nInputs shapes: ['No shapes', (1,), (), (), (348,)]\nInputs strides: ['No strides', (8,), (), (), (8,)]\nInputs values: [Generator(PCG64) at 0x7F1CCA3F2A40, array([10]), array(4), array(1.50162583), 'not shown']\nOutputs clients: [['output'], ['output']]\n\nHINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag 'optimizer=fast_compile'. If that does not work, PyTensor optimizations can be disabled with 'optimizer=None'.\nHINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.",
...

I can't figure out where this shape error arises from. The trained model specified in the top allows for updating of data without issues, so I am not sure if there is a general issue with the way the model is specified?

Is there a different process to saving and reloading a model with a BARTRV?

I also tried pickling the whole model, but that doesn't work because of the multiprocessing components in the BART object. I get a socket error when trying to reload the pickled object.

I have also posted this in the discourse, as I was not sure where it makes the most sense to discuss this issue.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions