Skip to content

Conversation

@vb-dbrks
Copy link
Contributor

@vb-dbrks vb-dbrks commented Jan 8, 2026

Changes

This PR adds ML-based anomaly detection to DQX, enabling users to detect unusual patterns in their data that can't be caught by traditional rule-based checks.

Key features:

  • Auto-discovery: Automatically selects relevant columns and creates segmented models when needed
  • Isolation Forest: Uses scikit-learn's Isolation Forest algorithm for fast, scalable anomaly detection
  • Explainability: SHAP-based feature contributions show why records were flagged
  • Unity Catalog integration: Models stored in UC with full lineage and versioning
  • New check function: has_no_anomalies() works like other DQX checks
  • Production defaults: Ensemble models (2x), 0.60 threshold, contributions enabled by default

What's included:

  • New AnomalyEngine for training models
  • Feature engineering for numeric, categorical, datetime, and boolean columns
  • Model registry with drift detection
  • demo 101
  • documentation updates

Resolves #957

Tests

  • manually tested (ran all demos on Databricks)
  • added unit tests (124 tests across 7 test files)
  • added integration tests (200+ tests covering training, scoring, ensemble, drift, etc.)
  • added end-to-end tests
  • added performance tests

…ion formatting and complexity handling; update MLflow tracking URI setup in trainer
… by consolidating conditional logic for single and multiple features
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: ML-based Anomaly Detection for row-level (has_no_anomalies)

3 participants