Skip to content

Commit bb8b39a

Browse files
authored
Add/feature extraction (#193)
* init & add feature extraction models * add `FeatureExtractorTransporter` to `Ensemble` model chain * add `FeatureHasher` test case * add `DictVectorizer` test case * add testcase runner for feature extraction models * add `util.py` for feature extraction model tests * implement `FeatureExtractorTransporter` * fix bug for when class constructor is passed not the concrete value * `CHANGELOG.md` updated * `autopep8.sh` applied * remove trailing whitespaces * remove trailing whitespaces * `CHANGELOG.md` updated * `SUPPORTED_MODELS.md` updated * update name in `pymilo_param.py` * integrate FeatureExtractor into serialization & deserialization of ensemble models * add `TfidfVectorizer` testcase * `TfidfTransformer` testcase * `PatchExtractor` testcase * `HashingVectorizer` testcase * `CountVectorizer` testcase * `RandomStateTransporter` added * `image` and `text` feature extractors added * enhance tuple serializer & deserializer * add new testcase for pipeline model with feature extractors inside * add other testcases to testcase runner * `CHANGELOG.md` updated * add PIL to dev-req * add PIL to dev-req * remove execution * update `args` serialized type in `AttributeCallPayload` * update PIL version to support old python versions * remove `get_feature_names_out` * add `set` support * remove `print`s * find `csr_matrix` attribute * add serializing & deserializing `csr_matrix` datatype for older python versions * refactor * retrieve csr_matrix from its serialized `__dict__` * init with shape * fix issue with key * enhance import to be broader * update tests * remove trailing whitespaces * `autopep8.sh` applied * `SUPPORTED_MODELS.md` updated * `CHANGELOG.md` updated * remove extra empty lines * update docstring * minor refactorings * update docstring * remove `.keys()` * update `Last Update` date
1 parent fab13e1 commit bb8b39a

18 files changed

+534
-13
lines changed

CHANGELOG.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,25 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
66

77
## [Unreleased]
88
### Added
9+
- `TfidfVectorizer` feature extractor
10+
- `TfidfTransformer` feature extractor
11+
- `HashingVectorizer` feature extractor
12+
- `CountVectorizer` feature extractor
13+
- `PatchExtractor` feature extractor
14+
- `DictVectorizer` feature extractor
15+
- `FeatureHasher` feature extractor
16+
- `FeatureExtractorTransporter` Transporter
17+
- `FeatureExtraction` support added to Ensemble chain
18+
- FeatureExtraction params initialized in `pymilo_param.py`
19+
- Feature Extraction models test runner
920
- Zenodo badge to `README.md`
1021
### Changed
22+
- `get_deserialized_list` in `GeneralDataStructureTransporter`
23+
- `get_deserialized_dict` in `GeneralDataStructureTransporter`
24+
- `serialize` in `GeneralDataStructureTransporter`
25+
- `serialize_tuple` in `GeneralDataStructureTransporter`
26+
- `AttributeCallPayload` in `streaming.communicator.py`
27+
- `get_deserialized_regular_primary_types` in `GeneralDataStructureTransporter`
1128
- Test system modified
1229
## [1.2] - 2025-01-22
1330
### Added

SUPPORTED_MODELS.md

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Supported Models
22

3-
**Last Update: 2024-10-31**
3+
**Last Update: 2025-2-15**
44

55

66
<h2 id="scikit-learn">Scikit-Learn</h2>
@@ -733,3 +733,48 @@
733733
<td>>=1.1</td>
734734
</tr>
735735
</table>
736+
737+
<h3 id="scikit-learn-feature-extraction">Feature Extraction Modules</h3>
738+
📚 <a href="https://scikit-learn.org/stable/api/sklearn.feature_extraction.html" target="_blank"><b>Models Document</b></a>
739+
<table>
740+
<tr align="center">
741+
<th>ID</th>
742+
<th>Model Name</th>
743+
<th>PyMilo Version</th>
744+
</tr>
745+
<tr align="center">
746+
<td>1</td>
747+
<td><b>DictVectorizer</b></td>
748+
<td>>=1.3</td>
749+
</tr>
750+
<tr align="center">
751+
<td>2</td>
752+
<td><b>FeatureHasher</b></td>
753+
<td>>=1.3</td>
754+
</tr>
755+
<tr align="center">
756+
<td>3</td>
757+
<td><b>PatchExtractor</b></td>
758+
<td>>=1.3</td>
759+
</tr>
760+
<tr align="center">
761+
<td>4</td>
762+
<td><b>CountVectorizer</b></td>
763+
<td>>=1.3</td>
764+
</tr>
765+
<tr align="center">
766+
<td>5</td>
767+
<td><b>HashingVectorizer</b></td>
768+
<td>>=1.3</td>
769+
</tr>
770+
<tr align="center">
771+
<td>6</td>
772+
<td><b>TfidfTransformer</b></td>
773+
<td>>=1.3</td>
774+
</tr>
775+
<tr align="center">
776+
<td>7</td>
777+
<td><b>TfidfVectorizer</b></td>
778+
<td>>=1.3</td>
779+
</tr>
780+
</table>

dev-requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,4 @@ bandit>=1.5.1
1212
pydocstyle>=3.0.0
1313
pytest>=4.3.1
1414
pytest-cov>=2.6.1
15+
Pillow>=8.4.0

pymilo/chains/ensemble_chain.py

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from numpy import ndarray, asarray
88

99
from ..chains.chain import AbstractChain
10+
from ..transporters.feature_extraction_transporter import FeatureExtractorTransporter
1011
from ..transporters.binmapper_transporter import BinMapperTransporter
1112
from ..transporters.bunch_transporter import BunchTransporter
1213
from ..transporters.transporter import Command
@@ -21,6 +22,7 @@
2122
from .util import get_concrete_transporter
2223

2324
ENSEMBLE_CHAIN = {
25+
"FeatureExtractorTransporter": FeatureExtractorTransporter(),
2426
"PreprocessingTransporter": PreprocessingTransporter(),
2527
"GeneralDataStructureTransporter": GeneralDataStructureTransporter(),
2628
"TreePredictorTransporter": TreePredictorTransporter(),
@@ -48,16 +50,19 @@ def serialize(self, ensemble_object):
4850
self._transporters[transporter].transport(
4951
ensemble_object, Command.SERIALIZE)
5052

53+
pt = ENSEMBLE_CHAIN["PreprocessingTransporter"]
54+
fe = ENSEMBLE_CHAIN["FeatureExtractorTransporter"]
5155
for key, value in ensemble_object.__dict__.items():
5256
if isinstance(value, list):
5357
has_inner_tuple_with_ml_model = False
54-
pt = PreprocessingTransporter()
5558
for idx, item in enumerate(value):
5659
if isinstance(item, tuple):
5760
listed_tuple = list(item)
5861
for inner_idx, inner_item in enumerate(listed_tuple):
5962
if pt.is_preprocessing_module(inner_item):
6063
listed_tuple[inner_idx] = pt.serialize_pre_module(inner_item)
64+
elif fe.is_fe_module(inner_item):
65+
listed_tuple[inner_idx] = fe.serialize_fe_module(inner_item)
6166
else:
6267
has_inner_model, result = serialize_possible_ml_model(inner_item)
6368
if has_inner_model:
@@ -117,17 +122,23 @@ def deserialize(self, ensemble, is_inner_model=False):
117122
self._transporters[transporter].transport(
118123
ensemble, Command.DESERIALIZE, is_inner_model)
119124

125+
pt = ENSEMBLE_CHAIN["PreprocessingTransporter"]
126+
fe = ENSEMBLE_CHAIN["FeatureExtractorTransporter"]
120127
for key, value in data.items():
121128
if isinstance(value, dict):
122129
if check_str_in_iterable("pymiloed-data-structure",
123130
value) and value["pymiloed-data-structure"] == "list of (str, estimator) tuples":
124131
listed_tuples = value["pymiloed-data"]
125132
list_of_tuples = []
126-
pt = PreprocessingTransporter()
127133
for listed_tuple in listed_tuples:
128134
name, serialized_model = listed_tuple
129-
retrieved_model = pt.deserialize_pre_module(serialized_model) if pt.is_preprocessing_module(
130-
serialized_model) else deserialize_possible_ml_model(serialized_model)[1]
135+
retrieved_model = None
136+
if pt.is_preprocessing_module(serialized_model):
137+
retrieved_model = pt.deserialize_pre_module(serialized_model)
138+
elif fe.is_fe_module(serialized_model):
139+
retrieved_model = fe.deserialize_fe_module(serialized_model)
140+
else:
141+
retrieved_model = deserialize_possible_ml_model(serialized_model)[1]
131142
list_of_tuples.append(
132143
(name, retrieved_model)
133144
)

pymilo/pymilo_param.py

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@
1313
import sklearn.ensemble as ensemble
1414
import sklearn.pipeline as pipeline
1515
import sklearn.preprocessing as preprocessing
16-
from sklearn.cross_decomposition import PLSRegression, PLSCanonical, CCA
16+
import sklearn.cross_decomposition as cross_decomposition
17+
import sklearn.feature_extraction as feature_extraction
1718

1819
quantile_regressor_support = False
1920
try:
@@ -246,10 +247,25 @@
246247
"TargetEncoder": TargetEncoder if target_encoder_support else NOT_SUPPORTED,
247248
}
248249

250+
SKLEARN_FEATURE_EXTRACTION_TABLE = {
251+
# for raw data:
252+
"DictVectorizer": feature_extraction.DictVectorizer,
253+
"FeatureHasher": feature_extraction.FeatureHasher,
254+
255+
# for image data:
256+
"PatchExtractor": feature_extraction.image.PatchExtractor,
257+
258+
# for text data:
259+
"CountVectorizer": feature_extraction.text.CountVectorizer,
260+
"HashingVectorizer": feature_extraction.text.HashingVectorizer,
261+
"TfidfTransformer": feature_extraction.text.TfidfTransformer,
262+
"TfidfVectorizer": feature_extraction.text.TfidfVectorizer,
263+
}
264+
249265
SKLEARN_CROSS_DECOMPOSITION_TABLE = {
250-
"PLSRegression": PLSRegression,
251-
"PLSCanonical": PLSCanonical,
252-
"CCA": CCA,
266+
"PLSRegression": cross_decomposition.PLSRegression,
267+
"PLSCanonical": cross_decomposition.PLSCanonical,
268+
"CCA": cross_decomposition.CCA,
253269
}
254270

255271
KEYS_NEED_PREPROCESSING_BEFORE_DESERIALIZATION = {

pymilo/streaming/communicator.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ class UploadPayload(StandardPayload):
124124

125125
class AttributeCallPayload(StandardPayload):
126126
attribute: str
127-
args: list
127+
args: dict
128128
kwargs: dict
129129

130130
class AttributeTypePayload(StandardPayload):
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# -*- coding: utf-8 -*-
2+
"""PyMilo Feature Extraction transporter."""
3+
from scipy.sparse import csr_matrix
4+
5+
from ..pymilo_param import SKLEARN_FEATURE_EXTRACTION_TABLE
6+
from ..utils.util import check_str_in_iterable, get_sklearn_type
7+
from .transporter import AbstractTransporter, Command
8+
from .general_data_structure_transporter import GeneralDataStructureTransporter
9+
from .randomstate_transporter import RandomStateTransporter
10+
11+
FEATURE_EXTRACTION_CHAIN = {
12+
"GeneralDataStructureTransporter": GeneralDataStructureTransporter(),
13+
"RandomStateTransporter": RandomStateTransporter(),
14+
}
15+
16+
17+
class FeatureExtractorTransporter(AbstractTransporter):
18+
"""Feature Extractor object dedicated Transporter."""
19+
20+
def serialize(self, data, key, model_type):
21+
"""
22+
Serialize Feature Extractor object.
23+
24+
serialize the data[key] of the given model which type is model_type.
25+
basically in order to fully serialize a model, we should traverse over all the keys of its data dictionary and
26+
pass it through the chain of associated transporters to get fully serialized.
27+
28+
:param data: the internal data dictionary of the given model
29+
:type data: dict
30+
:param key: the special key of the data param, which we're going to serialize its value(data[key])
31+
:type key: object
32+
:param model_type: the model type of the ML model, which data dictionary is given as the data param
33+
:type model_type: str
34+
:return: pymilo serialized output of data[key]
35+
"""
36+
if self.is_fe_module(data[key]):
37+
return self.serialize_fe_module(data[key])
38+
return data[key]
39+
40+
def deserialize(self, data, key, model_type):
41+
"""
42+
Deserialize previously pymilo serialized feature extraction object.
43+
44+
deserialize the data[key] of the given model which type is model_type.
45+
basically in order to fully deserialize a model, we should traverse over all the keys of its serialized data dictionary and
46+
pass it through the chain of associated transporters to get fully deserialized.
47+
48+
:param data: the internal data dictionary of the associated json file of the ML model which is generated previously by
49+
pymilo export.
50+
:type data: dict
51+
:param key: the special key of the data param, which we're going to deserialize its value(data[key])
52+
:type key: object
53+
:param model_type: the model type of the ML model, which internal serialized data dictionary is given as the data param
54+
:type model_type: str
55+
:return: pymilo deserialized output of data[key]
56+
"""
57+
content = data[key]
58+
if self.is_fe_module(content):
59+
return self.deserialize_fe_module(content)
60+
return content
61+
62+
def is_fe_module(self, fe_module):
63+
"""
64+
Check whether the given module is a sklearn Feature Extraction module or not.
65+
66+
:param fe_module: given object
67+
:type fe_module: any
68+
:return: bool
69+
"""
70+
if isinstance(fe_module, dict):
71+
return check_str_in_iterable(
72+
"pymilo-feature_extraction-type",
73+
fe_module) and fe_module["pymilo-feature_extraction-type"] in SKLEARN_FEATURE_EXTRACTION_TABLE
74+
return get_sklearn_type(fe_module) in SKLEARN_FEATURE_EXTRACTION_TABLE
75+
76+
def serialize_fe_module(self, fe_module):
77+
"""
78+
Serialize Feature Extraction object.
79+
80+
:param fe_module: given sklearn feature extraction module
81+
:type fe_module: sklearn.feature_extraction
82+
:return: pymilo serialized fe_module
83+
"""
84+
# add one depth inner preprocessing module population
85+
for key, value in fe_module.__dict__.items():
86+
if self.is_fe_module(value):
87+
fe_module.__dict__[key] = self.serialize_fe_module(value)
88+
elif isinstance(value, csr_matrix):
89+
fe_module.__dict__[key] = {
90+
"pymilo-bypass": True,
91+
"pymilo-csr_matrix": FEATURE_EXTRACTION_CHAIN["GeneralDataStructureTransporter"].serialize_dict(
92+
value.__dict__
93+
)
94+
}
95+
96+
for transporter in FEATURE_EXTRACTION_CHAIN:
97+
FEATURE_EXTRACTION_CHAIN[transporter].transport(
98+
fe_module, Command.SERIALIZE)
99+
return {
100+
"pymilo-bypass": True,
101+
"pymilo-feature_extraction-type": get_sklearn_type(fe_module),
102+
"pymilo-feature_extraction-data": fe_module.__dict__
103+
}
104+
105+
def deserialize_fe_module(self, serialized_fe_module):
106+
"""
107+
Deserialize Feature Extraction object.
108+
109+
:param serialized_fe_module: serializezd feature extraction module(by pymilo)
110+
:type serialized_fe_module: dict
111+
:return: retrieved associated sklearn.feature_extraction module
112+
"""
113+
data = serialized_fe_module["pymilo-feature_extraction-data"]
114+
associated_type = SKLEARN_FEATURE_EXTRACTION_TABLE[serialized_fe_module["pymilo-feature_extraction-type"]]
115+
retrieved_fe_module = associated_type()
116+
for key in data:
117+
# add one depth inner feature extraction module population
118+
if self.is_fe_module(data[key]):
119+
data[key] = self.deserialize_fe_module(data[key])
120+
elif check_str_in_iterable("pymilo-csr_matrix", data[key]):
121+
csr_matrix_dict = FEATURE_EXTRACTION_CHAIN["GeneralDataStructureTransporter"].get_deserialized_dict(
122+
data[key]["pymilo-csr_matrix"])
123+
cm = csr_matrix(csr_matrix_dict['_shape'])
124+
for _key in csr_matrix_dict:
125+
setattr(cm, _key, csr_matrix_dict[_key])
126+
data[key] = cm
127+
for transporter in FEATURE_EXTRACTION_CHAIN:
128+
data[key] = FEATURE_EXTRACTION_CHAIN[transporter].deserialize(data, key, "")
129+
for key in data:
130+
setattr(retrieved_fe_module, key, data[key])
131+
return retrieved_fe_module

pymilo/transporters/general_data_structure_transporter.py

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,9 @@ def serialize_tuple(self, tuple_field):
3030
new_tuple += (self.deep_serialize_ndarray(item),)
3131
else:
3232
new_tuple += (item,)
33-
return new_tuple
33+
return {
34+
"pymilo-tuple": new_tuple,
35+
}
3436

3537
# dict serializer for Logistic regression CV
3638
def serialize_dict(self, dictionary):
@@ -147,6 +149,11 @@ def serialize(self, data, key, model_type):
147149
elif isinstance(data[key], np.ndarray):
148150
data[key] = self.deep_serialize_ndarray(data[key])
149151

152+
elif isinstance(data[key], set):
153+
data[key] = {
154+
"pymilo-set": list(data[key])
155+
}
156+
150157
elif isinstance(data[key], dict):
151158
data[key] = self.serialize_dict(data[key])
152159

@@ -213,6 +220,12 @@ def get_deserialized_dict(self, content):
213220
if not isinstance(content, dict):
214221
return content
215222

223+
if check_str_in_iterable("pymilo-tuple", content):
224+
return tuple(self.get_deserialized_list(content["pymilo-tuple"]))
225+
226+
if check_str_in_iterable("pymilo-set", content):
227+
return set(self.get_deserialized_list(content["pymilo-set"]))
228+
216229
if self.is_deserialized_ndarray(content):
217230
return self.deep_deserialize_ndarray(content)
218231

@@ -261,7 +274,9 @@ def get_deserialized_list(self, content):
261274
"""
262275
new_list = []
263276
for item in content:
264-
if self.is_deserialized_ndarray(item):
277+
if check_str_in_iterable("pymilo-tuple", item):
278+
new_list.append(tuple(self.get_deserialized_list(content["pymilo-tuple"])))
279+
elif self.is_deserialized_ndarray(item):
265280
new_list.append(self.deep_deserialize_ndarray(item))
266281
else:
267282
new_list.append(self.deserialize_primitive_type(item))
@@ -281,7 +296,11 @@ def get_deserialized_regular_primary_types(self, content):
281296
"""
282297
if "np-type" in content:
283298
if content["np-type"] == "numpy.dtype":
284-
return NUMPY_TYPE_DICT[content["np-type"]](NUMPY_TYPE_DICT[content['value']])
299+
if isinstance(content["value"], str):
300+
# when the value is the associated type name like numpy.float64
301+
return NUMPY_TYPE_DICT[content["value"]]
302+
else:
303+
return NUMPY_TYPE_DICT[content["np-type"]](NUMPY_TYPE_DICT[content['value']])
285304
if content["np-type"] == "numpy.nan":
286305
return NUMPY_TYPE_DICT[content["np-type"]]
287306
return NUMPY_TYPE_DICT[content["np-type"]](content['value'])

0 commit comments

Comments
 (0)