Skip to content

Latest commit

 

History

History
97 lines (66 loc) · 3.01 KB

File metadata and controls

97 lines (66 loc) · 3.01 KB

Jackpot GSM8K Release Examples

This folder provides four release-style scripts on GSM8K:

  1. run_qwen2-0.5b-dual-kl-gsm8k.sh
  2. run_qwen2-0.5b-dual-kl-dapo-gsm8k.sh
  3. run_qwen2-0.5b-jackpot-gsm8k.sh
  4. run_qwen3-0.6b-base-jackpot-dapo-gsm8k.sh

The first two are dual-model joint training.
The last two are single-model Jackpot baselines.

1) Prepare GSM8K data

Reference: docs/examples/gsm8k_example.rst

From repository root:

cd examples/data_preprocess
python3 gsm8k.py --local_save_dir ~/data/gsm8k

Expected files:

  • ~/data/gsm8k/train.parquet
  • ~/data/gsm8k/test.parquet

Each script checks these files and exits with a hint if missing.

2) Run examples

From repository root:

bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-dual-kl-gsm8k.sh
bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-dual-kl-dapo-gsm8k.sh
bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-jackpot-gsm8k.sh
bash examples/jackpot_gsm8k_release/run_qwen3-0.6b-base-jackpot-dapo-gsm8k.sh

All scripts pass extra CLI overrides through "$@", so you can do:

bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-jackpot-gsm8k.sh \
  trainer.logger=[console] \
  trainer.total_epochs=2 \
  actor_rollout_ref.actor.optim.lr=5e-7

3) What Jackpot arguments mean

These are the key Jackpot-related fields used in all scripts:

  • actor_rollout_ref.actor.use_jackpot Turns Jackpot correction on or off.

  • actor_rollout_ref.actor.jackpot_use_latest_logits Uses current policy logits when computing Jackpot overlap/weights. True usually gives tighter alignment to the model actually being updated.

  • actor_rollout_ref.actor.jackpot_log_probs_to_keep Top-k width used by Jackpot overlap approximation.
    Larger k gives better overlap approximation but higher memory/compute.

  • actor_rollout_ref.actor.jackpot_lambda Acceptance-ratio scaling factor in Jackpot correction.
    Increasing it makes correction stricter (fewer accepted/carrying tokens).

  • actor_rollout_ref.actor.jackpot_clip_ratio Upper cap on Jackpot importance weights for stability.
    Lower cap is more conservative; higher cap is less biased but can be noisier.

  • actor_rollout_ref.actor.jackpot_use_topk_renorm Renormalizes overlap mass in top-k space.
    Keep this True in most runs unless you intentionally study this ablation.

Jackpot also depends on rollout-side log-prob collection:

  • actor_rollout_ref.rollout.calculate_log_probs=True
  • actor_rollout_ref.rollout.log_probs_to_keep=<same top-k as actor>

If these rollout settings are disabled or mismatched, Jackpot correction is not properly supported.

4) Script roles

  • run_qwen2-0.5b-dual-kl-gsm8k.sh GRPO trainer, dual namespace (large + small), pairwise KL coupling.

  • run_qwen2-0.5b-dual-kl-dapo-gsm8k.sh DAPO trainer variant of the same dual-namespace joint setup.

  • run_qwen2-0.5b-jackpot-gsm8k.sh Single-model GRPO baseline with Jackpot only.

  • run_qwen3-0.6b-base-jackpot-dapo-gsm8k.sh Single-model DAPO baseline with Jackpot only.