Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I wish feature engineering (i.e. creating new columns from old ones) could be more efficient and convenient in pandas.
Mainly, common ways of adding features to dataframes in pandas include
- using chained
.assign
statements (which are hard to debug and contain many hard-to-read lambda expressions) or - calling
df['new_column'] = ...
repeatedly in someadd_features
function, this is better for debugging purposes but also hard to read and inconvenient as the user always has to type quotes and the worddf
.
In R's mutate
function, the series are accessible directly from the scope which makes code much more readable (debugging in R is something else to discuss).
Feature Description
We could easily add this functionality by providing a context manager (perhaps pd.mutate, to follow R's naming here) which temporarily moves all columns of a dataframe into the caller's locals
, allows the caller to create new pd.Series while calling and then (upon the context manager's exit) all those new pd.Series (or the modified old ones) could be formed to a data frame again.
This makes feature engineering much more convenient, efficient and likely also more debuggable that using chained .assign statements (in the debugger, one could directly access all the pd.Series in that scope).
A minimal example implementation could look like the following:
# %%
from multiprocessing import context
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"col_a": [1, 2, 3, 4, 5],
"col_b": [10, 20, 30, 40, 50],
}
)
df
# %%
import inspect
from contextlib import contextmanager
from copy import deepcopy
class mutate_df:
def __init__(self, df: pd.DataFrame):
self.df = df
def __enter__(self):
frame = inspect.currentframe().f_back
self.scope_keys = deepcopy(self._extract_locals_keys_from_frame(frame))
for col in self.df.columns:
if col in self.scope_keys:
# Maybe give a warning here?
pass
frame.f_locals[col] = self.df[col]
def _extract_locals_keys_from_frame(self, frame):
s = {
str(key)
for key in frame.f_locals.keys()
if not key.startswith("_")
}
return s
def __exit__(self, exc_type, exc_val, exc_tb):
if exc_type:
raise exc_type(exc_val)
frame = inspect.currentframe().f_back
current_keys = self._extract_locals_keys_from_frame(frame)
added_keys = current_keys - self.scope_keys
added_keys = set.union(added_keys, set(self.df.columns))
for key in added_keys:
try:
val = frame.f_locals[key]
self.df[key] = val
except:
pass
return True
with mutate_df(df):
# All of df's columns are available in the scope
# as pd.Series objects
# Set the entire column to one value
c = 10
# Use columns defined previously
col_c = col_a * 20
# Create new columns
rolling_mean = col_b.rolling(2).mean()
col_b_cumsum = col_b.cumsum()
df
# %%
The drawback of this feature is that we are fiddling with the caller's locals
which is not the most elegant.
However, I believe that feature engineering like this is much better to debug and makes the code more readable (than using chained .assign
s or repeatedly calling df['new_feature'] = 2 * df['old_feature'] ** 2
).
Therefore I think this feature would make life easier and pandas more useful (and users faster) in data science tasks.
Alternative Solutions
One might want to handle the locals
better here to make the usage of this feature less error-prone.
Perhaps one would want to cache previous locals
and then only have the dataframe's columns as the locals in the caller's scope.
This would make debugging even more clean, because if a user sets a breakpoint in such a with pd.mutate
statement, then that user sees all the columns in the scope's locals clearly instead of having to inspect the dataframe's columns values in the debugger.
Additional Context
No response