Description
Some libraries like LightGBM are well integrated with pandas categorical
types.
I could not find a nice implementation to encode categorical features as pandas
categorical columns while preserving the categories across different datasets. I would like to
propose the addition of a PandasCategoricalEncoder
to the feature_engine
library to
address this issue.
Is your feature request related to a problem? Please describe.
Yes, I often encounter issues when working with categorical data in pandas. The current
methods do not ensure consistent encoding across different datasets, leading to
potential errors.
Describe the solution you'd like
I would like to implement the PandasCategoricalEncoder
class, which will transform
categorical features into pandas categorical types. This encoder will ensure that
categories are encoded consistently between training and testing datasets, and it will
handle unseen categories gracefully based on specified parameters.
Describe alternatives you've considered
I have considered using existing categorical encoding libraries, but they do not provide
such feature.
Additional context
The PandasCategoricalEncoder
will include features such as handling missing values,
allowing for flexible unseen category management, and providing methods for inverse
transformation to retrieve original values. This will enhance the usability and
reliability of categorical data processing in pandas.