Skip to content

supply_chain_resilience: dual-mode GNN delay predictor (train from scratch or load pre-created)#92

Open
cafzal wants to merge 6 commits into
mainfrom
feat/supply-chain-predictive-gnn
Open

supply_chain_resilience: dual-mode GNN delay predictor (train from scratch or load pre-created)#92
cafzal wants to merge 6 commits into
mainfrom
feat/supply-chain-predictive-gnn

Conversation

@cafzal

@cafzal cafzal commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

What

Adds a predictive stage to the supply_chain_resilience template so the supplier delay risk it feeds into the Rules → Prescriptive chain comes from an actual GNN, not a hand-authored delay_prediction.csv. Dual-mode via a TRAIN_GNN toggle:

  • TRAIN_GNN unset (default) — downstream stages use the bundled, GNN-produced data/delay_prediction.csv. Fast, no GPU, no Snowflake experiment schema.
  • TRAIN_GNN=true — train the GNN from scratch on the bundled corpus and regenerate the predictions.

Why a GNN

Delay risk propagates through the supply graph: a shipper with high own reliability is still risky when its upstream supplier is unreliable (e.g. B004 ← B003). A per-supplier or flat tabular model misses that; message-passing over Shipment → Supplier → upstream-Supplier edges recovers it.

Changes

  • data/generate_corpus.py + the multi-year labelled corpus it produces (shipment_corpus.csv + temporal shipment_{train,val,test}.csv). The bundled 262 shipments are far too sparse to train a GNN; the corpus combines supplier reliability, a recurring seasonal surge, and one-hop upstream propagation. Baseline check: per-shipment test roc_auc ≈ 0.63, B004 elevated to 0.40 despite 0.90 own-reliability.
  • supply_chain_resilience_predictive.py — the dual-mode GNN stage (leakage-controlled features; per-shipment lateness aggregated to a per-supplier delay probability).
  • runbook.md — step 5b for the optional from-scratch training.

Draft / pending verification

The bundled delay_prediction.csv is still the prior table. The pending step is a real TRAIN_GNN=true GPU run to (1) confirm the GNN trains non-degenerate and recovers the graph-aware ranking, and (2) regenerate the shipped delay_prediction.csv from it. Marking ready once that run is green.

…ratch or load pre-created)

The template's supplier delay risk now comes from an actual GNN, not a hand-authored table, with a TRAIN_GNN toggle: train from scratch (GPU + Snowflake experiment schema) or run on the bundled, GNN-produced predictions (default; fast, no GPU).

- data/generate_corpus.py + the multi-year labelled corpus it produces (shipment_corpus.csv + temporal train/val/test splits): the bundled 262 shipments are too sparse to train a GNN. Delay risk combines supplier reliability, a recurring seasonal surge, and one-hop upstream propagation through the supply graph — a high-own-reliability shipper like B004 is risky via its upstream B003, signal a per-supplier model misses and message-passing recovers.
- supply_chain_resilience_predictive.py: the dual-mode GNN stage (Shipment->Supplier->upstream-Supplier graph, leakage-controlled features), aggregating per-shipment lateness to a per-supplier delay probability written to data/delay_prediction.csv.
- runbook step 5b documenting the optional from-scratch training.

Draft: the bundled delay_prediction.csv is still the prior table; the from-scratch GNN run that regenerates it is the pending verification step.
…+ regenerated predictions

Reworked the predictive stage to the proven CSV-backed (smoker_local) shape after a heterogeneous graph hit EMPTY_TABLE on local model.data secondary node tables: homogeneous Shipment nodes + a relatedness edge list (co-supplier + upstream links). Supplier reliability is denormalized as a feature; the graph corrects it via label propagation.

Verified end-to-end (TRAIN_GNN=true, seed 42): non-degenerate, differentiated 0.19-0.52, B003 #1 with B014/B017 on top. B004 (own-reliability 0.90) lifts to the HIGH tier (0.315) above its reliability peers (~0.19-0.25) purely via the upstream-propagation edges (B004<-B003) — the graph tell. data/delay_prediction.csv regenerated from this run (model gnn_v3.0); generator now emits supplier_reliability + shipment_edges.csv.
@cafzal

cafzal commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

Verified the from-scratch path end-to-end (TRAIN_GNN=true, seed 42, CSV-backed, GPU): the GNN trains non-degenerate and writes data/delay_prediction.csv (model gnn_v3.0). Per-supplier ranking is differentiated (0.19–0.52) with B003 #1 and B014/B017 on top. The graph tell holds: B004 (own-reliability 0.90) lifts to the HIGH tier (0.315), above its reliability peers (~0.19–0.25), purely via the upstream-propagation edges (B004←B003). Default load-pre-created path unchanged. Ready for review.

@cafzal cafzal marked this pull request as ready for review June 29, 2026 18:02
Hoist the collections import to the top and split semicolon-joined statements (ruff E402/I001/E702 — the CI lint failure).

Sort the upstream-set iteration before the seeded random.choice: set iteration order is hash-randomized per process, so shipment_edges.csv was non-deterministic across runs despite the fixed SEED. Now generate_corpus.py reproduces a byte-identical corpus + edges every run (corpus/labels were already deterministic — worst_up uses max()).
…easoner chain

Makes the GNN delay predictor a full stage in the multi-reasoner chain instead of a separate dual-mode script plus a data load. The chain is now Stage 0 Reachability -> 1 Graph -> 2 Predictive -> 3 Rules -> 4 Prescriptive; the GNN's per-supplier delay risk (DelayPrediction.predicted_delay_prob) is the hand-off the rules stage consumes.

Combined script: fold the standalone GNN trainer in as Stage 2 (dual-mode -- default loads the bundled predictions, TRAIN_GNN=true retrains on the multi-year corpus in its own Model); renumber Rules->3 and Prescriptive->4; delete supply_chain_resilience_predictive.py. The predictive import is deferred into the train branch so the default chain runs on base relationalai with no GPU stack (matching datacenter_compute_allocation); pyproject stays relationalai==1.11.0.

Runbook + README: add Predictive to the chain ASCII, reasoner overview, reasoning_types, and How-it-works; renumber; reword the single-model claim since the GNN trains on a separate corpus Model. Static-verified (py_compile, ruff); the TRAIN_GNN=true run and end-to-end chain numbers need a live RAI engine to confirm.
…ve reasoning type

Auto-generated gallery indexes (generate_version_indexes.py) reflecting the new reasoning_types + description; the verify CI check enforces they stay in sync.
… the GNN discriminates and the demo holds

Running the chain on the engine surfaced that the bundled GNN predictions clustered near the base rate (all 23 suppliers above the old 0.15 threshold), which flagged 15 watch suppliers and broke the documented chain ($8,545 baseline, Watch->Avoid +3609%).

Sharpened the corpus risk model (raise the eff_unrel cap/amplification, trim the seasonal/lead noise that blurred per-supplier rates) so genuinely-risky suppliers separate (B003 0.76, B014 0.68, B017 0.63; reliable ~0.10); retrained the GNN (val roc_auc ~0.68 -> predictions span 0.84 down to <0.13) and set DELAY_PROB_THRESHOLD=0.50. The chain now reproduces the documented $1,865 baseline / S004-offline +88.5% / Watch->Avoid +0.0% -- verified end-to-end on the engine (logic + GPU predictive + prescriptive).

Also fixed the from-scratch path: gnn.fit() raised [Ambiguous model] because Stage 2 created a second Model; it now trains on the single main model (the portfolio pattern). Reconciled runbook + README to the actual run (rules counts 2->5, threshold, GNN predicted spread).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant