-
-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Hi, I have been trying to save, reload and generate new predictions with a model that includes a BARTRV.
I am able to save the trace as a pickle (net_cdf works too), and then instantiate a new model and get the posterior predictions on the training data, but when I try to add new data I get shape errors. The shape errors are odd since when I train the model I can update the model with new data for predictions without any issues. It is only when I use the newly instantiated model that I am unable to update the input data.
Below is a minimal example:
from pathlib import Path
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import pymc_bart as pmb
import cloudpickle as cpkl
import dill
print(f"Running on PyMC v{pm.__version__}")
print(f"Running on PyMC-BART v{pmb.__version__}")
try:
bikes = pd.read_csv(Path("..", "data", "bikes.csv"))
except FileNotFoundError:
bikes = pd.read_csv(pm.get_data("bikes.csv"))
features = ["hour", "temperature", "humidity", "workingday"]
X = bikes[features]
Y = bikes["count"]
xt = X[0:10]
yt = Y[0:10]
with pm.Model() as model_bikes:
xdata = pm.MutableData("xdata", X)
a = pm.Exponential("a", 1)
mu_ = pmb.BART("mu_", xdata, np.log(Y), m=20)
mu = pm.Deterministic("mu", pm.math.exp(mu_))
y = pm.NegativeBinomial("y", mu=mu, alpha=a, observed=Y, shape=xdata.shape[0])
idata_bikes = pm.sample(random_seed=99, draws=100, tune=100, compute_convergence_checks=False)
idata_bikes
Pickle instead of netcdf, but this seems to work fine
# # pickle
with open('test4.pkl', mode='wb') as file:
cpkl.dump(idata_bikes, file)
with open("test4.pkl", mode="rb") as file:
idata4 = cpkl.load(file)
Posterior predictions on updated data works with OG model with the OG idata and the saved and loaded idata
with model_bikes:
pm.set_data({"xdata": xt})
post1 = pm.sample_posterior_predictive(idata_bikes, var_names=["mu", "y"])
with model_bikes:
pm.set_data({"xdata": xt})
post2 = pm.sample_posterior_predictive(idata4, var_names=["mu", "y"])
print(post1.posterior_predictive["mu"].values.mean((0,1)))
print(post2.posterior_predictive["mu"].values.mean((0,1)))
Restart the session to test the load from a clean slate and reload the data from above
from pathlib import Path
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import pymc_bart as pmb
import cloudpickle as cpkl
import dill
print(f"Running on PyMC v{pm.__version__}")
print(f"Running on PyMC-BART v{pmb.__version__}")
try:
bikes = pd.read_csv(Path("..", "data", "bikes.csv"))
except FileNotFoundError:
bikes = pd.read_csv(pm.get_data("bikes.csv"))
features = ["hour", "temperature", "humidity", "workingday"]
X = bikes[features]
Y = bikes["count"]
xt = X[0:10]
yt = Y[0:10]
Specify the new model. Only difference is variable names
with pm.Model() as model2:
xdata2 = pm.MutableData("xdata", X)
a2 = pm.Exponential("a", 1)
mu_2 = pmb.BART("mu_", xdata2, np.log(Y), m=50)
mu2 = pm.Deterministic("mu", pm.math.exp(mu_2))
y2 = pm.NegativeBinomial("y", mu=mu2, alpha=a2, observed=Y, shape=xdata2.shape[0])
load the saved idata
with open("test4.pkl", mode="rb") as file:
idata4 = cpkl.load(file)
get posterior predictions on the training data
with model2:
post5 = pm.sample_posterior_predictive(idata4, var_names=["mu", "y"], )
This works minus a slight difference in predicted values, possible due to a difference in random state? The post5 compares well to the post1 and post2 above.
get the poster predictions with new data
with model2:
pm.set_data({"xdata": xt})
post4 = pm.sample_posterior_predictive(idata4, var_names=["mu", "y"])
This fails with the following error
"name": "ValueError",
"message": "size does not match the broadcast shape of the parameters. (10,), (10,), (348,)\nApply node that caused the error: nbinom_rv{0, (0, 0), int64, True}(RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F1CCA3F2A40>), MakeVector{dtype='int64'}.0, 4, a, Composite{...}.1)\nToposort index: 5\nInputs types: [RandomGeneratorType, TensorType(int64, shape=(1,)), TensorType(int64, shape=()), TensorType(float64, shape=()), TensorType(float64, shape=(None,))]\nInputs shapes: ['No shapes', (1,), (), (), (348,)]\nInputs strides: ['No strides', (8,), (), (), (8,)]\nInputs values: [Generator(PCG64) at 0x7F1CCA3F2A40, array([10]), array(4), array(1.50162583), 'not shown']\nOutputs clients: [['output'], ['output']]\n\nHINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag 'optimizer=fast_compile'. If that does not work, PyTensor optimizations can be disabled with 'optimizer=None'.\nHINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.",
...
I can't figure out where this shape error arises from. The trained model specified in the top allows for updating of data without issues, so I am not sure if there is a general issue with the way the model is specified?
Is there a different process to saving and reloading a model with a BARTRV?
I also tried pickling the whole model, but that doesn't work because of the multiprocessing components in the BART object. I get a socket error when trying to reload the pickled object.
I have also posted this in the discourse, as I was not sure where it makes the most sense to discuss this issue.
Thanks!