Skip to content

Add CJE integration for calibrated evaluation#2370

Open
elandesberg wants to merge 12 commits intoexpectedparrot:mainfrom
elandesberg:feature/cje-integration
Open

Add CJE integration for calibrated evaluation#2370
elandesberg wants to merge 12 commits intoexpectedparrot:mainfrom
elandesberg:feature/cje-integration

Conversation

@elandesberg
Copy link

Summary

Adds integration with CJE (Causal Judge Evaluation) to enable calibrated evaluation of AI survey responses against human ground truth.

Key benefits:

  • Calibrate with 5-10% human labels to get valid estimates for the full dataset
  • Proper uncertainty quantification with confidence intervals
  • Statistical comparison between models/policies

Usage

# Run survey with multiple models
results = survey.by([Model("gpt-4o"), Model("claude-3-5-sonnet")]).run()

# Add human labels for ~10% of responses
results = results.add_column("human_rating", human_labels)

# Calibrate
cal = results.calibrate("sentiment_score", "human_rating")

print(cal.estimates)
# {'gpt-4o': 0.72, 'claude-3-5-sonnet': 0.68}

print(cal.confidence_intervals)
# {'gpt-4o': (0.68, 0.76), 'claude-3-5-sonnet': (0.64, 0.72)}

# Statistical comparison
cal.compare("gpt-4o", "claude-3-5-sonnet")
# ComparisonResult(diff=0.04, p=0.046, significant=True)

Install: pip install edsl[cje]

Changes

  • New edsl/cje_integration/ package (calibrator, data adapters, result types)
  • Added Results.calibrate() convenience method
  • Added cje-eval as optional dependency with [cje] extra
  • Demo notebook: docs/notebooks/cje_calibration_demo.ipynb
  • 11 passing tests

Test plan

  • All 11 unit tests pass (pytest tests/cje_integration/ -v)
  • End-to-end test with real CJE passes
  • Review demo notebook runs correctly

Adds integration with CJE (Causal Judge Evaluation) to enable calibrated
evaluation of AI survey responses against human ground truth.

Features:
- New `edsl/cje_integration/` package with calibrator, data adapters, result types
- `Results.calibrate()` convenience method for one-line calibration
- Calibrate with 5-10% human labels to get valid estimates for full dataset
- Proper uncertainty quantification with confidence intervals
- Statistical comparison between models/policies

Usage:
    results = survey.by(models).run()
    results = results.add_column("human_rating", human_labels)
    cal = results.calibrate("sentiment_score", "human_rating")
    print(cal.estimates)  # {'gpt-4o': 0.72, 'claude': 0.68}
    cal.compare("gpt-4o", "claude")  # Statistical comparison

Install: pip install edsl[cje]

Includes:
- Demo notebook: docs/notebooks/cje_calibration_demo.ipynb
- 11 passing tests
@elandesberg
Copy link
Author

Any feedback is appreciated! Just a first pass - hope it makes sense.

@elandesberg
Copy link
Author

Actually, I just ran into a bug in the notebook, so please hold off on review.

@elandesberg elandesberg marked this pull request as draft January 12, 2026 18:47
EDSL Results doesn't have add_column(), so changed the API to accept
oracle labels directly as a list parameter instead of a column name.

Changes:
- data_adapters.py: oracle_column -> oracle_labels (list)
- calibrator.py: Updated CJECalibrator and calibrate() signatures
- results.py: Updated Results.calibrate() method
- notebook: Updated demo to pass human_labels list directly
@elandesberg elandesberg marked this pull request as ready for review January 12, 2026 20:23
@elandesberg
Copy link
Author

Ok, this is ready for a review. No rush. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant