Jackpot GSM8K Release Examples

This folder provides four release-style scripts on GSM8K:

run_qwen2-0.5b-dual-kl-gsm8k.sh
run_qwen2-0.5b-dual-kl-dapo-gsm8k.sh
run_qwen2-0.5b-jackpot-gsm8k.sh
run_qwen3-0.6b-base-jackpot-dapo-gsm8k.sh

The first two are dual-model joint training.
The last two are single-model Jackpot baselines.

1) Prepare GSM8K data

Reference: docs/examples/gsm8k_example.rst

From repository root:

cd examples/data_preprocess
python3 gsm8k.py --local_save_dir ~/data/gsm8k

Expected files:

~/data/gsm8k/train.parquet
~/data/gsm8k/test.parquet

Each script checks these files and exits with a hint if missing.

2) Run examples

From repository root:

bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-dual-kl-gsm8k.sh
bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-dual-kl-dapo-gsm8k.sh
bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-jackpot-gsm8k.sh
bash examples/jackpot_gsm8k_release/run_qwen3-0.6b-base-jackpot-dapo-gsm8k.sh

All scripts pass extra CLI overrides through "$@", so you can do:

bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-jackpot-gsm8k.sh \
  trainer.logger=[console] \
  trainer.total_epochs=2 \
  actor_rollout_ref.actor.optim.lr=5e-7

3) What Jackpot arguments mean

These are the key Jackpot-related fields used in all scripts:

actor_rollout_ref.actor.use_jackpot Turns Jackpot correction on or off.
actor_rollout_ref.actor.jackpot_use_latest_logits Uses current policy logits when computing Jackpot overlap/weights. True usually gives tighter alignment to the model actually being updated.
actor_rollout_ref.actor.jackpot_log_probs_to_keep Top-k width used by Jackpot overlap approximation.
Larger k gives better overlap approximation but higher memory/compute.
actor_rollout_ref.actor.jackpot_lambda Acceptance-ratio scaling factor in Jackpot correction.
Increasing it makes correction stricter (fewer accepted/carrying tokens).
actor_rollout_ref.actor.jackpot_clip_ratio Upper cap on Jackpot importance weights for stability.
Lower cap is more conservative; higher cap is less biased but can be noisier.
actor_rollout_ref.actor.jackpot_use_topk_renorm Renormalizes overlap mass in top-k space.
Keep this True in most runs unless you intentionally study this ablation.

Jackpot also depends on rollout-side log-prob collection:

actor_rollout_ref.rollout.calculate_log_probs=True
actor_rollout_ref.rollout.log_probs_to_keep=<same top-k as actor>

If these rollout settings are disabled or mismatched, Jackpot correction is not properly supported.

4) Script roles

run_qwen2-0.5b-dual-kl-gsm8k.sh GRPO trainer, dual namespace (large + small), pairwise KL coupling.
run_qwen2-0.5b-dual-kl-dapo-gsm8k.sh DAPO trainer variant of the same dual-namespace joint setup.
run_qwen2-0.5b-jackpot-gsm8k.sh Single-model GRPO baseline with Jackpot only.
run_qwen3-0.6b-base-jackpot-dapo-gsm8k.sh Single-model DAPO baseline with Jackpot only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jackpot GSM8K Release Examples

1) Prepare GSM8K data

2) Run examples

3) What Jackpot arguments mean

4) Script roles

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Jackpot GSM8K Release Examples

1) Prepare GSM8K data

2) Run examples

3) What Jackpot arguments mean

4) Script roles