Add CJE integration for calibrated evaluation by elandesberg · Pull Request #2370 · expectedparrot/edsl

elandesberg · 2026-01-12T04:57:26Z

Summary

Adds integration with CJE (Causal Judge Evaluation) to enable calibrated evaluation of AI survey responses against human ground truth.

Key benefits:

Calibrate with 5-10% human labels to get valid estimates for the full dataset
Proper uncertainty quantification with confidence intervals
Statistical comparison between models/policies

Usage

# Run survey with multiple models
results = survey.by([Model("gpt-4o"), Model("claude-3-5-sonnet")]).run()

# Add human labels for ~10% of responses
results = results.add_column("human_rating", human_labels)

# Calibrate
cal = results.calibrate("sentiment_score", "human_rating")

print(cal.estimates)
# {'gpt-4o': 0.72, 'claude-3-5-sonnet': 0.68}

print(cal.confidence_intervals)
# {'gpt-4o': (0.68, 0.76), 'claude-3-5-sonnet': (0.64, 0.72)}

# Statistical comparison
cal.compare("gpt-4o", "claude-3-5-sonnet")
# ComparisonResult(diff=0.04, p=0.046, significant=True)

Install: pip install edsl[cje]

Changes

New edsl/cje_integration/ package (calibrator, data adapters, result types)
Added Results.calibrate() convenience method
Added cje-eval as optional dependency with [cje] extra
Demo notebook: docs/notebooks/cje_calibration_demo.ipynb
11 passing tests

Test plan

All 11 unit tests pass (pytest tests/cje_integration/ -v)
End-to-end test with real CJE passes
Review demo notebook runs correctly

Adds integration with CJE (Causal Judge Evaluation) to enable calibrated evaluation of AI survey responses against human ground truth. Features: - New `edsl/cje_integration/` package with calibrator, data adapters, result types - `Results.calibrate()` convenience method for one-line calibration - Calibrate with 5-10% human labels to get valid estimates for full dataset - Proper uncertainty quantification with confidence intervals - Statistical comparison between models/policies Usage: results = survey.by(models).run() results = results.add_column("human_rating", human_labels) cal = results.calibrate("sentiment_score", "human_rating") print(cal.estimates) # {'gpt-4o': 0.72, 'claude': 0.68} cal.compare("gpt-4o", "claude") # Statistical comparison Install: pip install edsl[cje] Includes: - Demo notebook: docs/notebooks/cje_calibration_demo.ipynb - 11 passing tests

elandesberg · 2026-01-12T18:04:04Z

Any feedback is appreciated! Just a first pass - hope it makes sense.

elandesberg · 2026-01-12T18:47:09Z

Actually, I just ran into a bug in the notebook, so please hold off on review.

EDSL Results doesn't have add_column(), so changed the API to accept oracle labels directly as a list parameter instead of a column name. Changes: - data_adapters.py: oracle_column -> oracle_labels (list) - calibrator.py: Updated CJECalibrator and calibrate() signatures - results.py: Updated Results.calibrate() method - notebook: Updated demo to pass human_labels list directly

elandesberg · 2026-01-12T21:13:06Z

Ok, this is ready for a review. No rush. Thank you.

elandesberg marked this pull request as draft January 12, 2026 18:47

elandesberg added 3 commits January 12, 2026 11:39

Fix tests to use oracle_labels API

d73efa4

Improve tests: enable E2E, add edge cases for oracle_labels

5aa4be0

elandesberg marked this pull request as ready for review January 12, 2026 20:23

elandesberg added 8 commits January 12, 2026 12:31

Fix notebook install to use relative path for portability

0bb6da9

Suppress CJE internal messages for cleaner user experience

f762c6e

Add forest plot visualization for CalibrationResult

9fd3981

Use CJE's built-in forest plot visualization

1db0f31

Consolidate calibration output into single cell

b8a58b8

Suppress CJE logger warnings during calibration

2bb9a19

Suppress duplicate plot output with semicolon

6a6f39c

Clear notebook outputs for clean PR

3c20cb7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CJE integration for calibrated evaluation#2370

Add CJE integration for calibrated evaluation#2370
elandesberg wants to merge 12 commits intoexpectedparrot:mainfrom
elandesberg:feature/cje-integration

elandesberg commented Jan 12, 2026

Uh oh!

elandesberg commented Jan 12, 2026

Uh oh!

elandesberg commented Jan 12, 2026

Uh oh!

elandesberg commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

elandesberg commented Jan 12, 2026

Summary

Usage

Changes

Test plan

Uh oh!

elandesberg commented Jan 12, 2026

Uh oh!

elandesberg commented Jan 12, 2026

Uh oh!

elandesberg commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant