reporting: update report aggregation funcs#1156
Conversation
| ) | ||
| case "proportion_passing": | ||
| group_score = 100.0 * ( | ||
| len([p for p in probe_scores if p > 40]) / len(probe_scores) |
There was a problem hiding this comment.
Is 40 a hard probe score limit? If so, perhaps have:
| len([p for p in probe_scores if p > 40]) / len(probe_scores) | |
| DEFAULT_PROBE_SCORE_PASSING_THRESHOLD = 40 | |
| len([p for p in probe_scores if p > DEFAULT_PROBE_SCORE_PASSING_THRESHOLD]) / len(probe_scores) |
?
There was a problem hiding this comment.
it is - PR #1144 makes this a constant and usage here will be updated once that lands
| # top_score = passing_probe_count / probe_count | ||
| top_score = res.fetchone()[0] | ||
|
|
||
| group_score = None # range 0.0--100.0 |
There was a problem hiding this comment.
Is instantiation with None necessary here given that your default case is handled below?
There was a problem hiding this comment.
Valid point. I think it's good if things explode in test (eg. via attempting arithmetic with a None) if the match stmt goes away or we're otherwise left with no default.
Co-authored-by: Matthew Rowe <155050+mrowebot@users.noreply.github.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
erickgalinkin
left a comment
There was a problem hiding this comment.
Looks reasonable to me! @mrowebot's comments are valid, but not necessarily worth holding up the merge.
We can make this more fully-fledged and merge it with the more rigorous approach I took in a private branch as we move forward.
This PR allows a variety of group-level aggregations in reporting
There are risks in using aggregated garak results, e.g. taking means of all probes in one category. Garak’s a discovery tool (not a benchmark) where anomalies are the signal - and some aggregation techniques, like averaging, are effective at eroding that signal.
Two vignettes of how averaging makes garak results unusable:
The proposed change is to:
This means (a) garak scores will drop, (b) improved visibility over model inference security.
Additional changes:
always.Randomdetector that gives random scores in0..1_config;report_digestneeds to be able to run standalone, and running it multithreaded is not intended to be supportedVerification
python -m garak -m test -p encoding,xss,ansiescape -d always.Random --report_prefix ~/dev/garak/test(drop the use ofpxdthrough-dto test Z-score changes)_config.reporting.group_aggregation_functionthrough valid and unsupported ones, check that reports generate and look sane