Skip to content

ENH: pandas mutate, add R's mutate functionality to enable users to easily create new columns in data frames #56499

Open
@data-stepper

Description

@data-stepper

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish feature engineering (i.e. creating new columns from old ones) could be more efficient and convenient in pandas.
Mainly, common ways of adding features to dataframes in pandas include

  1. using chained .assign statements (which are hard to debug and contain many hard-to-read lambda expressions) or
  2. calling df['new_column'] = ... repeatedly in some add_features function, this is better for debugging purposes but also hard to read and inconvenient as the user always has to type quotes and the word df.

In R's mutate function, the series are accessible directly from the scope which makes code much more readable (debugging in R is something else to discuss).

Feature Description

We could easily add this functionality by providing a context manager (perhaps pd.mutate, to follow R's naming here) which temporarily moves all columns of a dataframe into the caller's locals, allows the caller to create new pd.Series while calling and then (upon the context manager's exit) all those new pd.Series (or the modified old ones) could be formed to a data frame again.

This makes feature engineering much more convenient, efficient and likely also more debuggable that using chained .assign statements (in the debugger, one could directly access all the pd.Series in that scope).
A minimal example implementation could look like the following:

# %%

from multiprocessing import context
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "col_a": [1, 2, 3, 4, 5],
        "col_b": [10, 20, 30, 40, 50],
    }
)

df

# %%

import inspect
from contextlib import contextmanager
from copy import deepcopy

class mutate_df:
    def __init__(self, df: pd.DataFrame):
        self.df = df

    def __enter__(self):
        frame = inspect.currentframe().f_back
        self.scope_keys = deepcopy(self._extract_locals_keys_from_frame(frame))

        for col in self.df.columns:
            if col in self.scope_keys:
                # Maybe give a warning here?
                pass

            frame.f_locals[col] = self.df[col]

    def _extract_locals_keys_from_frame(self, frame):
        s = {
            str(key)
            for key in frame.f_locals.keys()
            if not key.startswith("_")
        }
        return s

    def __exit__(self, exc_type, exc_val, exc_tb):
        if exc_type:
            raise exc_type(exc_val)

        frame = inspect.currentframe().f_back
        current_keys = self._extract_locals_keys_from_frame(frame)

        added_keys = current_keys - self.scope_keys
        added_keys = set.union(added_keys, set(self.df.columns))

        for key in added_keys:
            try:
                val = frame.f_locals[key]
                self.df[key] = val

            except:
                pass

        return True

with mutate_df(df):
    # All of df's columns are available in the scope
    # as pd.Series objects

    # Set the entire column to one value
    c = 10
    # Use columns defined previously
    col_c = col_a * 20

    # Create new columns
    rolling_mean = col_b.rolling(2).mean()
    col_b_cumsum = col_b.cumsum()

df


# %%

The drawback of this feature is that we are fiddling with the caller's locals which is not the most elegant.
However, I believe that feature engineering like this is much better to debug and makes the code more readable (than using chained .assigns or repeatedly calling df['new_feature'] = 2 * df['old_feature'] ** 2).

Therefore I think this feature would make life easier and pandas more useful (and users faster) in data science tasks.

Alternative Solutions

One might want to handle the locals better here to make the usage of this feature less error-prone.
Perhaps one would want to cache previous locals and then only have the dataframe's columns as the locals in the caller's scope.

This would make debugging even more clean, because if a user sets a breakpoint in such a with pd.mutate statement, then that user sees all the columns in the scope's locals clearly instead of having to inspect the dataframe's columns values in the debugger.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions