Skip to content

Commit 2c06f09

Browse files
TypingKoalaJohnny Bui
andauthored
Add Filtering and Aggregation Processors (#14)
* Add initial implementations of filter and aggregate * Add initial implementations of filter and aggregate * Add tests for filtering and aggregation * Add new processors to docs * add support for filterbyindex on single-indices * add additional test case for filterbyindex for single-indices * Add improved examples for aggregators * Increment version number and document changes * improve docstrings of filter and aggregate functions * add test to check that aggregate excludes non-numeric metrics Co-authored-by: Johnny Bui <[email protected]>
1 parent f1df36d commit 2c06f09

File tree

6 files changed

+269
-15
lines changed

6 files changed

+269
-15
lines changed

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Take a look at the notebooks below to demonstrate the functionality of FTPVL.
1313
1. [Using `HydraFetcher` and Processors](https://colab.research.google.com/drive/1BIQ-iulDFpzcve7lGJPwLePJ5ETBJ6Ut?usp=sharing)
1414
2. [Styling tables with `SingleTableVisualizer`](https://colab.research.google.com/drive/1u3EnmIYnTBk-LXZhqNHt_h4aMuq-_cWq?usp=sharing)
1515
3. [Comparing two different Evaluations](https://colab.research.google.com/drive/1I7InmA6210vIIwdQ7TGHE6aF_WwIm1dM?usp=sharing)
16+
4. [Filtering and Aggregating an Evaluation](https://colab.research.google.com/drive/1DDwlQFS81RGLL-q8DsgICF-HOC5ir6oS?usp=sharing)
1617

1718
## Documentation
1819
Extensive documentation, including a *Getting Started* guide, is available on
@@ -33,6 +34,7 @@ make html
3334
* `pandas`: for data management and processing ([website](https://pandas.pydata.org/))
3435
* `seaborn`: for colormap generation ([website](https://seaborn.pydata.org/))
3536
* `jinja2`: for visualization generation ([website](https://jinja.palletsprojects.com/))
37+
* `scipy`: for support of built-in aggregators([website](https://www.scipy.org/))
3638

3739
### Development Dependencies
3840
* `requests-mock`: for mocking request object for testing fetchers ([website](https://requests-mock.readthedocs.io/en/latest/))
@@ -44,6 +46,9 @@ make html
4446
* `sphinx-rtd-theme`: for documentation generation (theme) ([website](https://github.com/readthedocs/sphinx_rtd_theme))
4547

4648
## Changes
49+
### 0.1.6
50+
* Added support for filter and aggregator processors, fixes [#9](https://github.com/SymbiFlow/FPGA-Tool-Performance-Visualization-Library/issues/9)
51+
4752
### 0.1.5
4853
* Added support for custom projects and jobsets in HydraFetcher.
4954

docs/topics/api.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,15 @@ Processors API
6767
.. autoclass:: ftpvl.processors.RelativeDiff
6868
:members:
6969

70+
.. autoclass:: ftpvl.processors.FilterByIndex
71+
:members:
72+
73+
.. autoclass:: ftpvl.processors.Aggregate
74+
:members:
75+
76+
.. autoclass:: ftpvl.processors.GeomeanAggregate
77+
:members:
78+
7079
.. _topics-api-styles:
7180

7281
Styles API

ftpvl/processors.py

Lines changed: 107 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
11
""" Processors transform Evaluations to be more useful when visualized. """
2+
import math
3+
from typing import Any, Callable, Dict, List, Union
24

3-
from typing import List, Dict
4-
import pandas as pd
55
import numpy as np
6+
import pandas as pd
67
from ftpvl.evaluation import Evaluation
8+
from scipy import stats
79

810

911
class Processor:
@@ -438,3 +440,106 @@ def process(self, b: Evaluation) -> Evaluation:
438440
difference_eval = Evaluation(diff)
439441

440442
return difference_eval
443+
444+
class FilterByIndex(Processor):
445+
"""
446+
Processor that filters an Evaluation by matching a specified index value
447+
after indexing.
448+
449+
This is best used in a processing pipeline after the Reindex processor.
450+
For filtering an evaluation based on metrics (which is not an index),
451+
use the FilterByMetric processor.
452+
453+
Parameters
454+
----------
455+
index_name : str
456+
the name of the index to use when filtering
457+
index_value : Any
458+
the value to compare with
459+
460+
Examples
461+
--------
462+
>>> a = Evaluation(pd.DataFrame(
463+
... data=[
464+
... {"x": 1, "y": 5},
465+
... {"x": 4, "y": 10}
466+
... ],
467+
... index=pd.Index(["a", "b"], name="key")))
468+
>>> a.process([FilterByIndex("key", "a")]).get_df()
469+
x y
470+
key
471+
a 1 5
472+
"""
473+
def __init__(self, index_name: str, index_value: Any):
474+
self.index_name = index_name
475+
self.index_value = index_value
476+
477+
def process(self, input_eval: Evaluation):
478+
old_df = input_eval.get_df()
479+
if isinstance(old_df.index, pd.MultiIndex):
480+
new_df = old_df.xs(self.index_value, level=self.index_name)
481+
elif isinstance(old_df.index, pd.Index):
482+
# slicing instead of indexing to maintain shape
483+
new_df = old_df.loc[self.index_value:self.index_value]
484+
else:
485+
raise ValueError("Incompatible dataframe index.")
486+
return Evaluation(new_df, input_eval.get_eval_id())
487+
488+
class Aggregate(Processor):
489+
"""
490+
Processor that allows you to aggregate all the numeric fields of an
491+
Evaluation using a specified function.
492+
493+
This acts as a superclass for specific aggregator implementations, such as
494+
GeomeanAggregate. It can also be used for custom aggregations, by supplying
495+
an aggregator function to the constructor.
496+
497+
Parameters
498+
----------
499+
func : Callable[[pd.Series], Union[int, float]]
500+
a function that takes a Pandas Series and aggregates it into a single
501+
number, possibly a NaN value
502+
503+
Examples
504+
--------
505+
>>> a = Evaluation(pd.DataFrame(
506+
... data=[
507+
... {"x": 1, "y": 5},
508+
... {"x": 4, "y": 10}
509+
... ]))
510+
>>> a.process([Aggregate(lambda x: x.sum())]).get_df()
511+
x y
512+
0 5 15
513+
"""
514+
def __init__(self, func: Callable[[pd.Series], Union[int, float]]):
515+
self.func = func
516+
517+
def process(self, input_eval: Evaluation):
518+
old_df = input_eval.get_df()
519+
numeric_columns = old_df.select_dtypes(include=['number']).dropna(axis=1).columns
520+
new_df = pd.DataFrame([old_df[numeric_columns].agg(self.func)])
521+
return Evaluation(new_df, input_eval.get_eval_id())
522+
523+
class GeomeanAggregate(Aggregate):
524+
"""
525+
Processor that aggregates an entire Evaluation by finding the geometric mean of each
526+
numeric metric.
527+
528+
Subclass of Aggregate class.
529+
530+
Examples
531+
--------
532+
>>> a = Evaluation(pd.DataFrame(
533+
... data=[
534+
... {"x": 1, "y": 8},
535+
... {"x": 4, "y": 8}
536+
... ]))
537+
>>> a.process([GeomeanAggregate()).get_df()
538+
x y
539+
0 2 8
540+
"""
541+
def __init__(self):
542+
def geomean(x):
543+
x = x.dropna()
544+
return stats.gmean(x) if not x.empty else math.nan
545+
super().__init__(geomean)

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ pandas
22
requests-mock
33
seaborn
44
jinja2
5+
scipy
56

67
pylint
78
pytest

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
1919
AUTHOR = 'Johnny Bui'
2020
REQUIRES_PYTHON = '>=3.6.0'
21-
VERSION = '0.1.5'
21+
VERSION = '0.1.6'
2222

2323
# What packages are required for this module to be executed?
2424
REQUIRED = [

tests/test_processor.py

Lines changed: 146 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,18 +5,7 @@
55
assert_frame_equal, assert_series_equal, assert_index_equal
66
)
77

8-
from ftpvl.processors import (
9-
AddNormalizedColumn,
10-
CleanDuplicates,
11-
MinusOne,
12-
StandardizeTypes,
13-
ExpandColumn,
14-
Reindex,
15-
SortIndex,
16-
NormalizeAround,
17-
Normalize,
18-
RelativeDiff
19-
)
8+
from ftpvl.processors import *
209

2110
from ftpvl.evaluation import Evaluation
2211

@@ -34,6 +23,9 @@ class TestProcessor:
3423
SortIndex()
3524
NormalizeAround()
3625
Normalize()
26+
FilterByIndex()
27+
Aggregate()
28+
GeomeanAggregate()
3729
"""
3830

3931
def test_minusone(self):
@@ -505,3 +497,145 @@ def test_relativediff(self):
505497
)
506498

507499
assert_frame_equal(expected, result)
500+
501+
def test_filterbyindex_multindex(self):
502+
""" tests if filtering by index works for multi-index dataframe """
503+
# test dataframe
504+
# {"group": "a", "key": "a", "value": 10},
505+
# {"group": "a", "key": "b", "value": 5},
506+
# {"group": "a", "key": "c", "value": 3},
507+
# {"group": "b", "key": "d", "value": 100},
508+
# {"group": "b", "key": "e", "value": 31}
509+
510+
idx_arrays = [["a", "a", "a", "b", "b"], ["a", "b", "c", "d", "e"]]
511+
index = pd.MultiIndex.from_arrays(idx_arrays, names=("group", "key"))
512+
df = pd.DataFrame({"value": [10, 5, 3, 100, 31]}, index=index)
513+
eval1 = Evaluation(df, eval_id=10)
514+
515+
# filter by first index
516+
pipeline = [FilterByIndex("group", "a")]
517+
result = eval1.process(pipeline)
518+
519+
expected_index = pd.Index(["a", "b", "c"], name="key")
520+
expected_df = pd.DataFrame({"value": [10, 5, 3]}, index=expected_index)
521+
522+
assert_frame_equal(result.get_df(), expected_df)
523+
assert result.get_eval_id() == 10
524+
525+
# filter by second index
526+
pipeline = [FilterByIndex("key", "a")]
527+
result = eval1.process(pipeline)
528+
529+
expected_index = pd.Index(["a"], name="group")
530+
expected_df = pd.DataFrame({"value": [10]}, index=expected_index)
531+
532+
assert_frame_equal(result.get_df(), expected_df)
533+
assert result.get_eval_id() == 10
534+
535+
def test_filterbyindex_singleindex(self):
536+
""" tests if filtering by index works for single-index dataframe """
537+
# test dataframe
538+
# {"group": "a", "key": "a", "value": 10},
539+
# {"group": "a", "key": "b", "value": 5},
540+
# {"group": "a", "key": "c", "value": 3},
541+
# {"group": "b", "key": "d", "value": 100},
542+
# {"group": "b", "key": "e", "value": 31}
543+
544+
idx_array = ["a", "a", "a", "b", "b"]
545+
index = pd.Index(idx_array, name="key")
546+
df = pd.DataFrame({"value": [10, 5, 3, 100, 31]}, index=index)
547+
eval1 = Evaluation(df, eval_id=10)
548+
549+
# filter by first index
550+
pipeline = [FilterByIndex("key", "a")]
551+
result = eval1.process(pipeline)
552+
expected_index = pd.Index(["a", "a", "a"], name="key")
553+
expected_df = pd.DataFrame({"value": [10, 5, 3]}, index=expected_index)
554+
555+
assert_frame_equal(result.get_df(), expected_df)
556+
assert result.get_eval_id() == 10
557+
558+
def test_aggregate(self):
559+
""" Test aggregate processor with custom aggregator functions """
560+
df = pd.DataFrame(
561+
[
562+
{"a": 1, "b": 1, "c": 5},
563+
{"a": 1, "b": 2, "c": 4},
564+
{"a": 3, "b": 3, "c": 3},
565+
{"a": 4, "b": 4, "c": 2},
566+
{"a": 5, "b": 5, "c": 1},
567+
]
568+
)
569+
eval1 = Evaluation(df, eval_id=20)
570+
571+
pipeline = [Aggregate(lambda x: x.sum())]
572+
result = eval1.process(pipeline)
573+
574+
expected_df = pd.DataFrame(
575+
[
576+
{"a": 14, "b": 15, "c": 15}
577+
]
578+
)
579+
assert_frame_equal(result.get_df(), expected_df)
580+
assert eval1.get_eval_id() == 20
581+
582+
pipeline2 = [Aggregate(lambda x: x.product())]
583+
result2 = eval1.process(pipeline2)
584+
585+
expected_df2 = pd.DataFrame(
586+
[
587+
{"a": 60, "b": 120, "c": 120}
588+
]
589+
)
590+
assert_frame_equal(result2.get_df(), expected_df2)
591+
assert result2.get_eval_id() == 20
592+
593+
def test_aggregate_exclude_nonnumeric(self):
594+
""" Check if aggregate processor excludes fields that are non-numeric """
595+
df = pd.DataFrame(
596+
[
597+
{"a": 1, "b": 1, "c": "a"},
598+
{"a": 1, "b": 2, "c": "b"},
599+
{"a": 3, "b": 3, "c": "c"},
600+
{"a": 4, "b": 4, "c": "d"},
601+
{"a": 5, "b": 5, "c": "e"},
602+
]
603+
)
604+
eval1 = Evaluation(df, eval_id=20)
605+
606+
pipeline = [Aggregate(lambda x: x.sum())]
607+
result = eval1.process(pipeline)
608+
609+
expected_df = pd.DataFrame(
610+
[
611+
{"a": 14, "b": 15}
612+
]
613+
)
614+
assert_frame_equal(result.get_df(), expected_df)
615+
assert eval1.get_eval_id() == 20
616+
617+
def test_geomean_aggregate(self):
618+
""" Test built-in geomean aggregator """
619+
df = pd.DataFrame(
620+
[
621+
{"a": 1, "b": 1, "c": 5},
622+
{"a": 1, "b": 2, "c": 4},
623+
{"a": 3, "b": 3, "c": 3},
624+
{"a": 4, "b": 4, "c": 2},
625+
{"a": 5, "b": 5, "c": 1},
626+
]
627+
)
628+
eval1 = Evaluation(df, eval_id=20)
629+
630+
pipeline = [GeomeanAggregate()]
631+
eval1 = eval1.process(pipeline)
632+
633+
expected_a = (1 * 1 * 3 * 4 * 5) ** (1/5)
634+
expected_b = expected_c = (1 * 2 * 3 * 4 * 5) ** (1/5)
635+
expected_df = pd.DataFrame(
636+
[
637+
{"a": expected_a, "b": expected_b, "c": expected_c}
638+
]
639+
)
640+
assert_frame_equal(eval1.get_df(), expected_df)
641+
assert eval1.get_eval_id() == 20

0 commit comments

Comments
 (0)