This folder provides four release-style scripts on GSM8K:
run_qwen2-0.5b-dual-kl-gsm8k.shrun_qwen2-0.5b-dual-kl-dapo-gsm8k.shrun_qwen2-0.5b-jackpot-gsm8k.shrun_qwen3-0.6b-base-jackpot-dapo-gsm8k.sh
The first two are dual-model joint training.
The last two are single-model Jackpot baselines.
Reference: docs/examples/gsm8k_example.rst
From repository root:
cd examples/data_preprocess
python3 gsm8k.py --local_save_dir ~/data/gsm8kExpected files:
~/data/gsm8k/train.parquet~/data/gsm8k/test.parquet
Each script checks these files and exits with a hint if missing.
From repository root:
bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-dual-kl-gsm8k.sh
bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-dual-kl-dapo-gsm8k.sh
bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-jackpot-gsm8k.sh
bash examples/jackpot_gsm8k_release/run_qwen3-0.6b-base-jackpot-dapo-gsm8k.shAll scripts pass extra CLI overrides through "$@", so you can do:
bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-jackpot-gsm8k.sh \
trainer.logger=[console] \
trainer.total_epochs=2 \
actor_rollout_ref.actor.optim.lr=5e-7These are the key Jackpot-related fields used in all scripts:
-
actor_rollout_ref.actor.use_jackpotTurns Jackpot correction on or off. -
actor_rollout_ref.actor.jackpot_use_latest_logitsUses current policy logits when computing Jackpot overlap/weights.Trueusually gives tighter alignment to the model actually being updated. -
actor_rollout_ref.actor.jackpot_log_probs_to_keepTop-k width used by Jackpot overlap approximation.
Larger k gives better overlap approximation but higher memory/compute. -
actor_rollout_ref.actor.jackpot_lambdaAcceptance-ratio scaling factor in Jackpot correction.
Increasing it makes correction stricter (fewer accepted/carrying tokens). -
actor_rollout_ref.actor.jackpot_clip_ratioUpper cap on Jackpot importance weights for stability.
Lower cap is more conservative; higher cap is less biased but can be noisier. -
actor_rollout_ref.actor.jackpot_use_topk_renormRenormalizes overlap mass in top-k space.
Keep thisTruein most runs unless you intentionally study this ablation.
Jackpot also depends on rollout-side log-prob collection:
actor_rollout_ref.rollout.calculate_log_probs=Trueactor_rollout_ref.rollout.log_probs_to_keep=<same top-k as actor>
If these rollout settings are disabled or mismatched, Jackpot correction is not properly supported.
-
run_qwen2-0.5b-dual-kl-gsm8k.shGRPO trainer, dual namespace (large+small), pairwise KL coupling. -
run_qwen2-0.5b-dual-kl-dapo-gsm8k.shDAPO trainer variant of the same dual-namespace joint setup. -
run_qwen2-0.5b-jackpot-gsm8k.shSingle-model GRPO baseline with Jackpot only. -
run_qwen3-0.6b-base-jackpot-dapo-gsm8k.shSingle-model DAPO baseline with Jackpot only.