Skip to content

Commit 929453b

Browse files
committed
Pushing the docs to dev/ for branch: main, commit 9f8668182ab1492923057aca05ce6f2de38af02d
1 parent 135e66d commit 929453b

File tree

1,546 files changed

+6489
-6606
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,546 files changed

+6489
-6606
lines changed
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

dev/_downloads/067cd5d39b097d2c49dd98f563dac13a/plot_iterative_imputer_variants_comparison.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"\n# Imputing missing values with variants of IterativeImputer\n\n.. currentmodule:: sklearn\n\nThe :class:`~impute.IterativeImputer` class is very flexible - it can be\nused with a variety of estimators to do round-robin regression, treating every\nvariable as an output in turn.\n\nIn this example we compare some estimators for the purpose of missing feature\nimputation with :class:`~impute.IterativeImputer`:\n\n* :class:`~linear_model.BayesianRidge`: regularized linear regression\n* :class:`~ensemble.RandomForestRegressor`: Forests of randomized trees regression\n* :func:`~pipeline.make_pipeline` (:class:`~kernel_approximation.Nystroem`,\n :class:`~linear_model.Ridge`): a pipeline with the expansion of a degree 2\n polynomial kernel and regularized linear regression\n* :class:`~neighbors.KNeighborsRegressor`: comparable to other KNN\n imputation approaches\n\nOf particular interest is the ability of\n:class:`~impute.IterativeImputer` to mimic the behavior of missForest, a\npopular imputation package for R.\n\nNote that :class:`~neighbors.KNeighborsRegressor` is different from KNN\nimputation, which learns from samples with missing values by using a distance\nmetric that accounts for missing values, rather than imputing them.\n\nThe goal is to compare different estimators to see which one is best for the\n:class:`~impute.IterativeImputer` when using a\n:class:`~linear_model.BayesianRidge` estimator on the California housing\ndataset with a single value randomly removed from each row.\n\nFor this particular pattern of missing values we see that\n:class:`~linear_model.BayesianRidge` and\n:class:`~ensemble.RandomForestRegressor` give the best results.\n\nIt should be noted that some estimators such as\n:class:`~ensemble.HistGradientBoostingRegressor` can natively deal with\nmissing features and are often recommended over building pipelines with\ncomplex and costly missing values imputation strategies.\n"
7+
"\n# Imputing missing values with variants of IterativeImputer\n\n.. currentmodule:: sklearn\n\nThe :class:`~impute.IterativeImputer` class is very flexible - it can be\nused with a variety of estimators to do round-robin regression, treating every\nvariable as an output in turn.\n\nIn this example we compare some estimators for the purpose of missing feature\nimputation with :class:`~impute.IterativeImputer`:\n\n* :class:`~linear_model.BayesianRidge`: regularized linear regression\n* :class:`~ensemble.RandomForestRegressor`: forests of randomized trees regression\n* :func:`~pipeline.make_pipeline` (:class:`~kernel_approximation.Nystroem`,\n :class:`~linear_model.Ridge`): a pipeline with the expansion of a degree 2\n polynomial kernel and regularized linear regression\n* :class:`~neighbors.KNeighborsRegressor`: comparable to other KNN\n imputation approaches\n\nOf particular interest is the ability of\n:class:`~impute.IterativeImputer` to mimic the behavior of missForest, a\npopular imputation package for R.\n\nNote that :class:`~neighbors.KNeighborsRegressor` is different from KNN\nimputation, which learns from samples with missing values by using a distance\nmetric that accounts for missing values, rather than imputing them.\n\nThe goal is to compare different estimators to see which one is best for the\n:class:`~impute.IterativeImputer` when using a\n:class:`~linear_model.BayesianRidge` estimator on the California housing\ndataset with a single value randomly removed from each row.\n\nFor this particular pattern of missing values we see that\n:class:`~linear_model.BayesianRidge` and\n:class:`~ensemble.RandomForestRegressor` give the best results.\n\nIt should be noted that some estimators such as\n:class:`~ensemble.HistGradientBoostingRegressor` can natively deal with\nmissing features and are often recommended over building pipelines with\ncomplex and costly missing values imputation strategies.\n"
88
]
99
},
1010
{
@@ -15,7 +15,7 @@
1515
},
1616
"outputs": [],
1717
"source": [
18-
"# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nfrom sklearn.datasets import fetch_california_housing\nfrom sklearn.ensemble import RandomForestRegressor\n\n# To use this experimental feature, we need to explicitly ask for it:\nfrom sklearn.experimental import enable_iterative_imputer # noqa: F401\nfrom sklearn.impute import IterativeImputer, SimpleImputer\nfrom sklearn.kernel_approximation import Nystroem\nfrom sklearn.linear_model import BayesianRidge, Ridge\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.neighbors import KNeighborsRegressor\nfrom sklearn.pipeline import make_pipeline\n\nN_SPLITS = 5\n\nrng = np.random.RandomState(0)\n\nX_full, y_full = fetch_california_housing(return_X_y=True)\n# ~2k samples is enough for the purpose of the example.\n# Remove the following two lines for a slower run with different error bars.\nX_full = X_full[::10]\ny_full = y_full[::10]\nn_samples, n_features = X_full.shape\n\n# Estimate the score on the entire dataset, with no missing values\nbr_estimator = BayesianRidge()\nscore_full_data = pd.DataFrame(\n cross_val_score(\n br_estimator, X_full, y_full, scoring=\"neg_mean_squared_error\", cv=N_SPLITS\n ),\n columns=[\"Full Data\"],\n)\n\n# Add a single missing value to each row\nX_missing = X_full.copy()\ny_missing = y_full\nmissing_samples = np.arange(n_samples)\nmissing_features = rng.choice(n_features, n_samples, replace=True)\nX_missing[missing_samples, missing_features] = np.nan\n\n# Estimate the score after imputation (mean and median strategies)\nscore_simple_imputer = pd.DataFrame()\nfor strategy in (\"mean\", \"median\"):\n estimator = make_pipeline(\n SimpleImputer(missing_values=np.nan, strategy=strategy), br_estimator\n )\n score_simple_imputer[strategy] = cross_val_score(\n estimator, X_missing, y_missing, scoring=\"neg_mean_squared_error\", cv=N_SPLITS\n )\n\n# Estimate the score after iterative imputation of the missing values\n# with different estimators\nestimators = [\n BayesianRidge(),\n RandomForestRegressor(\n # We tuned the hyperparameters of the RandomForestRegressor to get a good\n # enough predictive performance for a restricted execution time.\n n_estimators=4,\n max_depth=10,\n bootstrap=True,\n max_samples=0.5,\n n_jobs=2,\n random_state=0,\n ),\n make_pipeline(\n Nystroem(kernel=\"polynomial\", degree=2, random_state=0), Ridge(alpha=1e3)\n ),\n KNeighborsRegressor(n_neighbors=15),\n]\nscore_iterative_imputer = pd.DataFrame()\n# iterative imputer is sensible to the tolerance and\n# dependent on the estimator used internally.\n# we tuned the tolerance to keep this example run with limited computational\n# resources while not changing the results too much compared to keeping the\n# stricter default value for the tolerance parameter.\ntolerances = (1e-3, 1e-1, 1e-1, 1e-2)\nfor impute_estimator, tol in zip(estimators, tolerances):\n estimator = make_pipeline(\n IterativeImputer(\n random_state=0, estimator=impute_estimator, max_iter=25, tol=tol\n ),\n br_estimator,\n )\n score_iterative_imputer[impute_estimator.__class__.__name__] = cross_val_score(\n estimator, X_missing, y_missing, scoring=\"neg_mean_squared_error\", cv=N_SPLITS\n )\n\nscores = pd.concat(\n [score_full_data, score_simple_imputer, score_iterative_imputer],\n keys=[\"Original\", \"SimpleImputer\", \"IterativeImputer\"],\n axis=1,\n)\n\n# plot california housing results\nfig, ax = plt.subplots(figsize=(13, 6))\nmeans = -scores.mean()\nerrors = scores.std()\nmeans.plot.barh(xerr=errors, ax=ax)\nax.set_title(\"California Housing Regression with Different Imputation Methods\")\nax.set_xlabel(\"MSE (smaller is better)\")\nax.set_yticks(np.arange(means.shape[0]))\nax.set_yticklabels([\" w/ \".join(label) for label in means.index.tolist()])\nplt.tight_layout(pad=1)\nplt.show()"
18+
"# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\n\nfrom sklearn.datasets import fetch_california_housing\nfrom sklearn.ensemble import RandomForestRegressor\n\n# To use this experimental feature, we need to explicitly ask for it:\nfrom sklearn.experimental import enable_iterative_imputer # noqa: F401\nfrom sklearn.impute import IterativeImputer, SimpleImputer\nfrom sklearn.kernel_approximation import Nystroem\nfrom sklearn.linear_model import BayesianRidge, Ridge\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.neighbors import KNeighborsRegressor\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import RobustScaler\n\nN_SPLITS = 5\n\nX_full, y_full = fetch_california_housing(return_X_y=True)\n# ~2k samples is enough for the purpose of the example.\n# Remove the following two lines for a slower run with different error bars.\nX_full = X_full[::10]\ny_full = y_full[::10]\nn_samples, n_features = X_full.shape\n\n\ndef compute_score_for(X, y, imputer=None):\n # We scale data before imputation and training a target estimator,\n # because our target estimator and some of the imputers assume\n # that the features have similar scales.\n if imputer is None:\n estimator = make_pipeline(RobustScaler(), BayesianRidge())\n else:\n estimator = make_pipeline(RobustScaler(), imputer, BayesianRidge())\n return cross_val_score(\n estimator, X, y, scoring=\"neg_mean_squared_error\", cv=N_SPLITS\n )\n\n\n# Estimate the score on the entire dataset, with no missing values\nscore_full_data = pd.DataFrame(\n compute_score_for(X_full, y_full),\n columns=[\"Full Data\"],\n)\n\n# Add a single missing value to each row\nrng = np.random.RandomState(0)\nX_missing = X_full.copy()\ny_missing = y_full\nmissing_samples = np.arange(n_samples)\nmissing_features = rng.choice(n_features, n_samples, replace=True)\nX_missing[missing_samples, missing_features] = np.nan\n\n# Estimate the score after imputation (mean and median strategies)\nscore_simple_imputer = pd.DataFrame()\nfor strategy in (\"mean\", \"median\"):\n score_simple_imputer[strategy] = compute_score_for(\n X_missing, y_missing, SimpleImputer(strategy=strategy)\n )\n\n# Estimate the score after iterative imputation of the missing values\n# with different estimators\nnamed_estimators = [\n (\"Bayesian Ridge\", BayesianRidge()),\n (\n \"Random Forest\",\n RandomForestRegressor(\n # We tuned the hyperparameters of the RandomForestRegressor to get a good\n # enough predictive performance for a restricted execution time.\n n_estimators=5,\n max_depth=10,\n bootstrap=True,\n max_samples=0.5,\n n_jobs=2,\n random_state=0,\n ),\n ),\n (\n \"Nystroem + Ridge\",\n make_pipeline(\n Nystroem(kernel=\"polynomial\", degree=2, random_state=0), Ridge(alpha=1e4)\n ),\n ),\n (\n \"k-NN\",\n KNeighborsRegressor(n_neighbors=10),\n ),\n]\nscore_iterative_imputer = pd.DataFrame()\n# Iterative imputer is sensitive to the tolerance and\n# dependent on the estimator used internally.\n# We tuned the tolerance to keep this example run with limited computational\n# resources while not changing the results too much compared to keeping the\n# stricter default value for the tolerance parameter.\ntolerances = (1e-3, 1e-1, 1e-1, 1e-2)\nfor (name, impute_estimator), tol in zip(named_estimators, tolerances):\n score_iterative_imputer[name] = compute_score_for(\n X_missing,\n y_missing,\n IterativeImputer(\n random_state=0, estimator=impute_estimator, max_iter=40, tol=tol\n ),\n )\n\nscores = pd.concat(\n [score_full_data, score_simple_imputer, score_iterative_imputer],\n keys=[\"Original\", \"SimpleImputer\", \"IterativeImputer\"],\n axis=1,\n)\n\n# plot california housing results\nfig, ax = plt.subplots(figsize=(13, 6))\nmeans = -scores.mean()\nerrors = scores.std()\nmeans.plot.barh(xerr=errors, ax=ax)\nax.set_title(\"California Housing Regression with Different Imputation Methods\")\nax.set_xlabel(\"MSE (smaller is better)\")\nax.set_yticks(np.arange(means.shape[0]))\nax.set_yticklabels([\" w/ \".join(label) for label in means.index.tolist()])\nplt.tight_layout(pad=1)\nplt.show()"
1919
]
2020
}
2121
],
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

dev/_downloads/54823a4305997fc1281f34ce676fb43e/plot_iterative_imputer_variants_comparison.py

Lines changed: 50 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
imputation with :class:`~impute.IterativeImputer`:
1414
1515
* :class:`~linear_model.BayesianRidge`: regularized linear regression
16-
* :class:`~ensemble.RandomForestRegressor`: Forests of randomized trees regression
16+
* :class:`~ensemble.RandomForestRegressor`: forests of randomized trees regression
1717
* :func:`~pipeline.make_pipeline` (:class:`~kernel_approximation.Nystroem`,
1818
:class:`~linear_model.Ridge`): a pipeline with the expansion of a degree 2
1919
polynomial kernel and regularized linear regression
@@ -62,28 +62,39 @@
6262
from sklearn.model_selection import cross_val_score
6363
from sklearn.neighbors import KNeighborsRegressor
6464
from sklearn.pipeline import make_pipeline
65+
from sklearn.preprocessing import RobustScaler
6566

6667
N_SPLITS = 5
6768

68-
rng = np.random.RandomState(0)
69-
7069
X_full, y_full = fetch_california_housing(return_X_y=True)
7170
# ~2k samples is enough for the purpose of the example.
7271
# Remove the following two lines for a slower run with different error bars.
7372
X_full = X_full[::10]
7473
y_full = y_full[::10]
7574
n_samples, n_features = X_full.shape
7675

76+
77+
def compute_score_for(X, y, imputer=None):
78+
# We scale data before imputation and training a target estimator,
79+
# because our target estimator and some of the imputers assume
80+
# that the features have similar scales.
81+
if imputer is None:
82+
estimator = make_pipeline(RobustScaler(), BayesianRidge())
83+
else:
84+
estimator = make_pipeline(RobustScaler(), imputer, BayesianRidge())
85+
return cross_val_score(
86+
estimator, X, y, scoring="neg_mean_squared_error", cv=N_SPLITS
87+
)
88+
89+
7790
# Estimate the score on the entire dataset, with no missing values
78-
br_estimator = BayesianRidge()
7991
score_full_data = pd.DataFrame(
80-
cross_val_score(
81-
br_estimator, X_full, y_full, scoring="neg_mean_squared_error", cv=N_SPLITS
82-
),
92+
compute_score_for(X_full, y_full),
8393
columns=["Full Data"],
8494
)
8595

8696
# Add a single missing value to each row
97+
rng = np.random.RandomState(0)
8798
X_missing = X_full.copy()
8899
y_missing = y_full
89100
missing_samples = np.arange(n_samples)
@@ -93,48 +104,52 @@
93104
# Estimate the score after imputation (mean and median strategies)
94105
score_simple_imputer = pd.DataFrame()
95106
for strategy in ("mean", "median"):
96-
estimator = make_pipeline(
97-
SimpleImputer(missing_values=np.nan, strategy=strategy), br_estimator
98-
)
99-
score_simple_imputer[strategy] = cross_val_score(
100-
estimator, X_missing, y_missing, scoring="neg_mean_squared_error", cv=N_SPLITS
107+
score_simple_imputer[strategy] = compute_score_for(
108+
X_missing, y_missing, SimpleImputer(strategy=strategy)
101109
)
102110

103111
# Estimate the score after iterative imputation of the missing values
104112
# with different estimators
105-
estimators = [
106-
BayesianRidge(),
107-
RandomForestRegressor(
108-
# We tuned the hyperparameters of the RandomForestRegressor to get a good
109-
# enough predictive performance for a restricted execution time.
110-
n_estimators=4,
111-
max_depth=10,
112-
bootstrap=True,
113-
max_samples=0.5,
114-
n_jobs=2,
115-
random_state=0,
113+
named_estimators = [
114+
("Bayesian Ridge", BayesianRidge()),
115+
(
116+
"Random Forest",
117+
RandomForestRegressor(
118+
# We tuned the hyperparameters of the RandomForestRegressor to get a good
119+
# enough predictive performance for a restricted execution time.
120+
n_estimators=5,
121+
max_depth=10,
122+
bootstrap=True,
123+
max_samples=0.5,
124+
n_jobs=2,
125+
random_state=0,
126+
),
116127
),
117-
make_pipeline(
118-
Nystroem(kernel="polynomial", degree=2, random_state=0), Ridge(alpha=1e3)
128+
(
129+
"Nystroem + Ridge",
130+
make_pipeline(
131+
Nystroem(kernel="polynomial", degree=2, random_state=0), Ridge(alpha=1e4)
132+
),
133+
),
134+
(
135+
"k-NN",
136+
KNeighborsRegressor(n_neighbors=10),
119137
),
120-
KNeighborsRegressor(n_neighbors=15),
121138
]
122139
score_iterative_imputer = pd.DataFrame()
123-
# iterative imputer is sensible to the tolerance and
140+
# Iterative imputer is sensitive to the tolerance and
124141
# dependent on the estimator used internally.
125-
# we tuned the tolerance to keep this example run with limited computational
142+
# We tuned the tolerance to keep this example run with limited computational
126143
# resources while not changing the results too much compared to keeping the
127144
# stricter default value for the tolerance parameter.
128145
tolerances = (1e-3, 1e-1, 1e-1, 1e-2)
129-
for impute_estimator, tol in zip(estimators, tolerances):
130-
estimator = make_pipeline(
146+
for (name, impute_estimator), tol in zip(named_estimators, tolerances):
147+
score_iterative_imputer[name] = compute_score_for(
148+
X_missing,
149+
y_missing,
131150
IterativeImputer(
132-
random_state=0, estimator=impute_estimator, max_iter=25, tol=tol
151+
random_state=0, estimator=impute_estimator, max_iter=40, tol=tol
133152
),
134-
br_estimator,
135-
)
136-
score_iterative_imputer[impute_estimator.__class__.__name__] = cross_val_score(
137-
estimator, X_missing, y_missing, scoring="neg_mean_squared_error", cv=N_SPLITS
138153
)
139154

140155
scores = pd.concat(
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)