Skip to content

Commit ae009d8

Browse files
Merge branch 'feature-engine:main' into profiling_functionality
2 parents e98fe3c + feddb06 commit ae009d8

File tree

24 files changed

+550
-18
lines changed

24 files changed

+550
-18
lines changed

.circleci/config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -151,4 +151,4 @@ workflows:
151151
filters:
152152
branches:
153153
only:
154-
- 1.5.X
154+
- 1.6.X

docs/whats_new/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Find out what's new in each new version release.
88
.. toctree::
99
:maxdepth: 2
1010

11+
v_160
1112
v_150
1213
v_140
1314
v_130

docs/whats_new/v_160.rst

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
Version 1.6.X
2+
=============
3+
4+
Version 1.6.0
5+
-------------
6+
7+
Deployed: 16th March 2023
8+
9+
Contributors
10+
~~~~~~~~~~~~
11+
12+
- `Gleb Levitski <https://github.com/GLevv>`_
13+
- `Morgan Sell <https://github.com/Morgan-Sell>`_
14+
- `Alfonso Tobar <https://github.com/datacubeR>`_
15+
- `Nodar Okroshiashvili <https://github.com/Okroshiashvili>`_
16+
- `Luís Seabra <https://github.com/luismavs>`_
17+
- `Kyle Gilde <https://github.com/kylegilde>`_
18+
- `Soledad Galli <https://github.com/solegalli>`_
19+
20+
In this release, we make Feature-engine transformers compatible with the `set_output`
21+
API from Scikit-learn, which was released in version 1.2.0. We also make Feature-engine
22+
compatible with the newest direction of pandas, in removing the `inplace` functionality
23+
that our transformers use under the hood.
24+
25+
We introduce a major change: most of the **categorical encoders can now encode variables
26+
even if they have missing data**.
27+
28+
We are also releasing **3 brand new transformers**: One for discretization, one for feature
29+
selection and one for operations between datetime variables.
30+
31+
We also made a major improvement in the performance of the `DropDuplicateFeatures` and some
32+
smaller bug fixes here and there.
33+
34+
We'd like to thank all contributors for fixing bugs and expanding the functionality
35+
and documentation of Feature-engine.
36+
37+
Thank you so much to all contributors and to those of you who created issues flagging bugs or
38+
requesting new functionality.
39+
40+
New transformers
41+
~~~~~~~~~~~~~~~~
42+
43+
- **ProbeFeatureSelection**: introduces random features and selects variables whose importance is greater than the random ones (`Morgan Sell <https://github.com/Morgan-Sell>`_ and `Soledad Galli <https://github.com/solegalli>`_)
44+
- **DatetimeSubtraction**: creates new features by subtracting datetime variables (`Kyle Gilde <https://github.com/kylegilde>`_ and `Soledad Galli <https://github.com/solegalli>`_)
45+
- **GeometricWidthDiscretiser**: sorts continuous variables into intervals determined by geometric progression (`Gleb Levitski <https://github.com/GLevv>`_)
46+
47+
New functionality
48+
~~~~~~~~~~~~~~~~~
49+
50+
- Allow categorical encoders to encode variables with NaN (`Soledad Galli <https://github.com/solegalli>`_)
51+
- Make transformers compatible with new `set_output` functionality from sklearn (`Soledad Galli <https://github.com/solegalli>`_)
52+
- The `ArbitraryDiscretiser()` now includes the lowest limits in the intervals (`Soledad Galli <https://github.com/solegalli>`_)
53+
54+
New modules
55+
~~~~~~~~~~~
56+
57+
- New **Datasets** module with functions to load specific datasets (`Alfonso Tobar <https://github.com/datacubeR>`_)
58+
- New **variable_handling** module with functions to automatically select numerical, categorical, or datetime variables (`Soledad Galli <https://github.com/solegalli>`_)
59+
60+
Bug fixes
61+
~~~~~~~~~
62+
63+
- Fixed bug in `DropFeatures()` (`Luís Seabra <https://github.com/luismavs>`_)
64+
- Fixed bug in `RecursiveFeatureElimination()` caused when only 1 feature remained in data (`Soledad Galli <https://github.com/solegalli>`_)
65+
66+
Documentation
67+
~~~~~~~~~~~~~
68+
69+
- Add example code snippets to the selection module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
70+
- Add example code snippets to the outlier module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
71+
- Add example code snippets to the transformation module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
72+
- Add example code snippets to the time series module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
73+
- Add example code snippets to the preprocessing module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
74+
- Add example code snippets to the wrapper module API docs (`Alfonso Tobar <https://github.com/datacubeR>`_)
75+
- Updated documentation using new Dataset module (`Alfonso Tobar <https://github.com/datacubeR>`_ and `Soledad Galli <https://github.com/solegalli>`_)
76+
- Reorganized Readme badges (`Gleb Levitski <https://github.com/GLevv>`_)
77+
- New Jupyter notebooks for `GeometricWidthDiscretiser` (`Gleb Levitski <https://github.com/GLevv>`_)
78+
- Fixed typos (`Gleb Levitski <https://github.com/GLevv>`_)
79+
- Remove examples using the boston house dataset (`Soledad Galli <https://github.com/solegalli>`_)
80+
- Update sponsor page and contribute page (`Soledad Galli <https://github.com/solegalli>`_)
81+
82+
83+
Deprecations
84+
~~~~~~~~~~~~
85+
86+
- The class `PRatioEncoder` is no longer supported and was removed from the API (`Soledad Galli <https://github.com/solegalli>`_)
87+
88+
Code improvements
89+
~~~~~~~~~~~~~~~~~
90+
91+
- Massive improvement in the performance (speed) of `DropDuplicateFeatures()` (`Nodar Okroshiashvili <https://github.com/Okroshiashvili>`_)
92+
- Remove `inplace` and other issues related to pandas new direction (`Luís Seabra <https://github.com/luismavs>`_)
93+
- Move most docstrings to dedicated docstrings module (`Soledad Galli <https://github.com/solegalli>`_)
94+
- Unnest tests for encoders (`Soledad Galli <https://github.com/solegalli>`_)

feature_engine/VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
1.5.2
1+
1.6.0

feature_engine/datetime/datetime_subtraction.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -318,7 +318,7 @@ def _sub(self, dt_df: pd.DataFrame):
318318
new_df[new_varnames] = (
319319
dt_df[self.variables_]
320320
.sub(dt_df[reference], axis=0)
321-
.apply(lambda s: s / np.timedelta64(1, self.output_unit))
321+
.div(np.timedelta64(1, self.output_unit).astype("timedelta64[ns]"))
322322
)
323323

324324
if self.new_variables_names is not None:

feature_engine/imputation/drop_missing_data.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -205,7 +205,7 @@ def return_na_data(self, X: pd.DataFrame) -> pd.DataFrame:
205205
idx = pd.isnull(X[self.variables_]).mean(axis=1) >= self.threshold
206206
idx = idx[idx]
207207
else:
208-
idx = pd.isnull(X[self.variables_]).any(1)
208+
idx = pd.isnull(X[self.variables_]).any(axis=1)
209209
idx = idx[idx]
210210

211211
return X.loc[idx.index, :]

feature_engine/outliers/artbitrary.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,27 @@ class ArbitraryOutlierCapper(BaseOutlier):
9191
transform:
9292
Cap the variables.
9393
94+
Examples
95+
--------
96+
97+
>>> import pandas as pd
98+
>>> from feature_engine.outliers import ArbitraryOutlierCapper
99+
>>> X = pd.DataFrame(dict(x1 = [1,2,3,4,5,6,7,8,9,10]))
100+
>>> aoc = ArbitraryOutlierCapper(max_capping_dict=dict(x1 = 8),
101+
>>> min_capping_dict=dict(x1 = 2))
102+
>>> aoc.fit(X)
103+
>>> aoc.transform(X)
104+
x1
105+
0 2
106+
1 2
107+
2 3
108+
3 4
109+
4 5
110+
5 6
111+
6 7
112+
7 8
113+
8 8
114+
9 8
94115
"""
95116

96117
def __init__(

feature_engine/outliers/trimmer.py

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,61 @@ class OutlierTrimmer(WinsorizerBase):
8989
transform:
9090
Remove outliers.
9191
92+
Examples
93+
--------
94+
95+
>>> import pandas as pd
96+
>>> from feature_engine.outliers import OutlierTrimmer
97+
>>> X = pd.DataFrame(dict(x = [0.49671,
98+
>>> -0.1382,
99+
>>> 0.64768,
100+
>>> 1.52302,
101+
>>> -0.2341,
102+
>>> -17.2341,
103+
>>> 1.57921,
104+
>>> 0.76743,
105+
>>> -0.4694,
106+
>>> 0.54256]))
107+
>>> ot = OutlierTrimmer(capping_method='gaussian', tail='left', fold=3)
108+
>>> ot.fit(X)
109+
>>> ot.transform(X)
110+
x
111+
0 0.49671
112+
1 -0.13820
113+
2 0.64768
114+
3 1.52302
115+
4 -0.23410
116+
5 -17.23410
117+
6 1.57921
118+
7 0.76743
119+
8 -0.46940
120+
9 0.54256
121+
122+
>>> import pandas as pd
123+
>>> from feature_engine.outliers import OutlierTrimmer
124+
>>> X = pd.DataFrame(dict(x = [0.49671,
125+
>>> -0.1382,
126+
>>> 0.64768,
127+
>>> 1.52302,
128+
>>> -0.2341,
129+
>>> -17.2341,
130+
>>> 1.57921,
131+
>>> 0.76743,
132+
>>> -0.4694,
133+
>>> 0.54256]))
134+
>>> ot = OutlierTrimmer(capping_method='mad', tail='left', fold=3)
135+
>>> ot.fit(X)
136+
>>> ot.transform(X)
137+
x
138+
0 0.49671
139+
1 -0.13820
140+
2 0.64768
141+
3 1.52302
142+
4 -0.23410
143+
6 1.57921
144+
7 0.76743
145+
8 -0.46940
146+
9 0.54256
92147
"""
93148

94149
def transform(self, X: pd.DataFrame) -> pd.DataFrame:

feature_engine/outliers/winsorizer.py

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,48 @@ class Winsorizer(WinsorizerBase):
9797
transform:
9898
Cap the variables.
9999
100+
Examples
101+
--------
102+
103+
>>> import numpy as np
104+
>>> import pandas as pd
105+
>>> from feature_engine.outliers import Winsorizer
106+
>>> np.random.seed(42)
107+
>>> X = pd.DataFrame(dict(x = np.random.normal(size = 10)))
108+
>>> wz = Winsorizer(capping_method='mad', tail='both', fold=3)
109+
>>> wz.fit(X)
110+
>>> wz.transform(X)
111+
x
112+
0 0.496714
113+
1 -0.138264
114+
2 0.647689
115+
3 1.523030
116+
4 -0.234153
117+
5 -0.234137
118+
6 1.579213
119+
7 0.767435
120+
8 -0.469474
121+
9 0.542560
122+
123+
>>> import numpy as np
124+
>>> import pandas as pd
125+
>>> from feature_engine.outliers import Winsorizer
126+
>>> np.random.seed(42)
127+
>>> X = pd.DataFrame(dict(x = np.random.normal(size = 10)))
128+
>>> wz = Winsorizer(capping_method='mad', tail='both', fold=3)
129+
>>> wz.fit(X)
130+
>>> wz.transform(X)
131+
x
132+
0 0.496714
133+
1 -0.138264
134+
2 0.647689
135+
3 1.523030
136+
4 -0.234153
137+
5 -0.234137
138+
6 1.579213
139+
7 0.767435
140+
8 -0.469474
141+
9 0.542560
100142
"""
101143

102144
def __init__(

feature_engine/preprocessing/match_categories.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,27 @@ class MatchCategories(
8888
8989
transform:
9090
Enforce the type of categorical variables as dtype `categorical`.
91+
92+
Examples
93+
--------
94+
95+
>>> import pandas as pd
96+
>>> from feature_engine.preprocessing import MatchCategories
97+
>>> X_train = pd.DataFrame(dict(x1 = ["a","b","c"], x2 = [4,5,6]))
98+
>>> X_test = pd.DataFrame(dict(x1 = ["c","b","a","d"], x2 = [5,6,4,7]))
99+
>>> mc = MatchCategories(missing_values="ignore")
100+
>>> mc.fit(X_train)
101+
>>> mc.transform(X_train)
102+
x1 x2
103+
0 a 4
104+
1 b 5
105+
2 c 6
106+
>>> mc.transform(X_test)
107+
x1 x2
108+
0 c 5
109+
1 b 6
110+
2 a 4
111+
3 NaN 7
91112
"""
92113

93114
def __init__(

feature_engine/preprocessing/match_columns.py

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,50 @@ class MatchVariables(BaseEstimator, TransformerMixin, GetFeatureNamesOutMixin):
100100
101101
transform:
102102
Add or delete variables to match those observed in the train set.
103+
104+
Examples
105+
--------
106+
107+
>>> import pandas as pd
108+
>>> from feature_engine.preprocessing import MatchVariables
109+
>>> X_train = pd.DataFrame(dict(x1 = ["a","b","c"], x2 = [4,5,6]))
110+
>>> X_test = pd.DataFrame(dict(x1 = ["c","b","a","d"],
111+
>>> x2 = [5,6,4,7],
112+
>>> x3 = [1,1,1,1]))
113+
>>> mv = MatchVariables(missing_values="ignore")
114+
>>> mv.fit(X_train)
115+
>>> mv.transform(X_train)
116+
x1 x2
117+
0 a 4
118+
1 b 5
119+
2 c 6
120+
>>> mv.transform(X_test)
121+
The following variables are dropped from the DataFrame: ['x3']
122+
x1 x2
123+
0 c 5
124+
1 b 6
125+
2 a 4
126+
3 d 7
127+
128+
>>> import pandas as pd
129+
>>> from feature_engine.preprocessing import MatchVariables
130+
>>> X_train = pd.DataFrame(dict(x1 = ["a","b","c"],
131+
>>> x2 = [4,5,6], x3 = [1,1,1]))
132+
>>> X_test = pd.DataFrame(dict(x1 = ["c","b","a","d"], x2 = [5,6,4,7]))
133+
>>> mv = MatchVariables(missing_values="ignore")
134+
>>> mv.fit(X_train)
135+
>>> mv.transform(X_train)
136+
x1 x2 x3
137+
0 a 4 1
138+
1 b 5 1
139+
2 c 6 1
140+
>>> mv.transform(X_test)
141+
The following variables are added to the DataFrame: ['x3']
142+
x1 x2 x3
143+
0 c 5 NaN
144+
1 b 6 NaN
145+
2 a 4 NaN
146+
3 d 7 NaN
103147
"""
104148

105149
def __init__(

feature_engine/timeseries/forecasting/expanding_window_features.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,28 @@ class ExpandingWindowFeatures(BaseForecastTransformer):
117117
pandas.expanding
118118
pandas.aggregate
119119
pandas.shift
120+
121+
Examples
122+
--------
123+
124+
>>> import pandas as pd
125+
>>> from feature_engine.timeseries.forecasting import ExpandingWindowFeatures
126+
>>> X = pd.DataFrame(dict(date = ["2022-09-18",
127+
>>> "2022-09-19",
128+
>>> "2022-09-20",
129+
>>> "2022-09-21",
130+
>>> "2022-09-22"],
131+
>>> x1 = [1,2,3,4,5],
132+
>>> x2 = [6,7,8,9,10]
133+
>>> ))
134+
>>> ewf = ExpandingWindowFeatures()
135+
>>> ewf.fit_transform(X)
136+
date x1 x2 x1_expanding_mean x2_expanding_mean
137+
0 2022-09-18 1 6 NaN NaN
138+
1 2022-09-19 2 7 1.0 6.0
139+
2 2022-09-20 3 8 1.5 6.5
140+
3 2022-09-21 4 9 2.0 7.0
141+
4 2022-09-22 5 10 2.5 7.5
120142
"""
121143

122144
def __init__(

0 commit comments

Comments
 (0)