-
Notifications
You must be signed in to change notification settings - Fork 3.4k
feat(policies): add autoregressive VLAs with tokenization PiFast #2734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces autoregressive Vision-Language-Action (VLA) models to LeRobot, implementing PiFast alongside existing flow-matching policies. Unlike flow matching which predicts actions in parallel over a horizon, this implementation models actions sequentially as discrete tokens using the FAST (Fast Action Sequence Tokenization) tokenizer. The PR provides a complete reference implementation including model architecture, training scripts, and processor pipelines.
Key Changes:
- Implements PI0Fast policy with autoregressive action token prediction using cross-entropy loss
- Adds FAST tokenizer integration for converting continuous actions to discrete tokens via DCT coefficients and BPE
- Introduces custom attention masking patterns supporting bidirectional attention for images/language and causal attention for action tokens
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/lerobot/utils/constants.py | Adds constants for action tokens and token masks |
| src/lerobot/processor/tokenizer_processor.py | Implements ActionTokenizerProcessorStep for tokenizing actions using FAST with PaliGemma token space conversion |
| src/lerobot/processor/init.py | Exports ActionTokenizerProcessorStep for use in pipelines |
| src/lerobot/policies/pi0_fast/train_fast_tokenizer.py | Provides training script for FAST tokenizer with delta transforms, normalization, and compression statistics |
| src/lerobot/policies/pi0_fast/processor_pi0_fast.py | Creates pre/post-processor pipelines including state discretization and language tokenization |
| src/lerobot/policies/pi0_fast/modeling_pi0_fast.py | Implements core PI0FastPytorch model with PaliGemma+Gemma expert architecture and autoregressive decoding |
| src/lerobot/policies/pi0_fast/configuration_pi0_fast.py | Defines PI0FastConfig with model hyperparameters and training settings |
| src/lerobot/policies/pi0_fast/init.py | Exports PI0Fast components for module access |
| src/lerobot/policies/factory.py | Registers PI0FastPolicy in the policy factory |
| src/lerobot/policies/init.py | Exports PI0FastConfig at package level |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| padding="max_length", | ||
| ), | ||
| ActionTokenizerProcessorStep( | ||
| tokenizer_name="/fsx/jade_choghari/outputs/fast_tokenizer", # TODO: jade put the PI |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line contains a hardcoded file path that appears to be a personal development path. This should be replaced with a configurable parameter or removed before merging. The tokenizer path should be passed via the config or made configurable through a proper mechanism.
| padding="max_length", | ||
| ), | ||
| ActionTokenizerProcessorStep( | ||
| tokenizer_name="/fsx/jade_choghari/outputs/fast_tokenizer", # TODO: jade put the PI |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions a TODO with the author's name. This indicates that the tokenizer path configuration is incomplete and needs to be properly addressed. The hardcoded path should be replaced with a proper configuration parameter that can be passed to the ActionTokenizerProcessorStep.
| # # Optionally visualize the attention mask | ||
| # self.visualize_attention_mask( | ||
| # att_mask_segments=att_mask_segments, | ||
| # att_2d_masks=att_masks, | ||
| # save_path="/admin/home/jade_choghari/lerobot/src/lerobot/policies/pi05/attention_mask_visualization.png", | ||
| # batch_idx=0, | ||
| # max_display_tokens=512 # Limit display for very long sequences | ||
| # ) | ||
|
|
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's commented-out visualization code that should be removed or properly implemented. If attention mask visualization is needed for debugging, it should be controlled by a configuration parameter rather than left as commented code.
| # # Optionally visualize the attention mask | |
| # self.visualize_attention_mask( | |
| # att_mask_segments=att_mask_segments, | |
| # att_2d_masks=att_masks, | |
| # save_path="/admin/home/jade_choghari/lerobot/src/lerobot/policies/pi05/attention_mask_visualization.png", | |
| # batch_idx=0, | |
| # max_display_tokens=512 # Limit display for very long sequences | |
| # ) |
| ) | ||
| # Detokenize action tokens to continuous actions | ||
| action_horizon = self.config.n_action_steps | ||
| action_dim = 7 |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded action dimension value of 7 should be made configurable. This magic number limits the flexibility of the model and should be replaced with a configuration parameter, possibly using self.config.max_action_dim or a similar configurable value.
| action_dim = 7 | |
| action_dim = getattr(self.config, "max_action_dim", 7) |
| """ | ||
| Inefficient but safe autoregressive decoding for FAST tokens. | ||
| Matches the pattern of _generate_subtask_tokens. | ||
| TODO: jadechoghari, should we move this logic to PI0FastPolicy class? |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment indicates incomplete implementation. The author questions whether this method is necessary and whether to move the logic to the PI0FastPolicy class. This should be resolved before merging - either implement the proper location for this logic or confirm the current implementation is correct and remove the TODO.
| TODO: jadechoghari, should we move this logic to PI0FastPolicy class? |
| tokenizer_max_length: int = 200 # see openpi `__post_init__` | ||
|
|
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration has a duplicate field definition. The field 'tokenizer_max_length' is defined both at line 65 and line 95 with the same default value. This duplication should be removed.
| tokenizer_max_length: int = 200 # see openpi `__post_init__` |
| # # Apply dtype conversion to FAST layers to match model precision | ||
| # if config.dtype == "bfloat16": | ||
| # self.fast_action_embedding = self.fast_action_embedding.to(dtype=torch.bfloat16) | ||
| # self.fast_action_lm_head = self.fast_action_lm_head.to(dtype=torch.bfloat16) | ||
| # elif config.dtype == "float32": | ||
| # self.fast_action_embedding = self.fast_action_embedding.to(dtype=torch.float32) | ||
| # self.fast_action_lm_head = self.fast_action_lm_head.to(dtype=torch.float32) |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's commented-out code that should either be removed or properly implemented before merging. This appears to be related to FAST layer dtype conversion. If this functionality is not needed, it should be removed to keep the codebase clean.
| # # Apply dtype conversion to FAST layers to match model precision | |
| # if config.dtype == "bfloat16": | |
| # self.fast_action_embedding = self.fast_action_embedding.to(dtype=torch.bfloat16) | |
| # self.fast_action_lm_head = self.fast_action_lm_head.to(dtype=torch.bfloat16) | |
| # elif config.dtype == "float32": | |
| # self.fast_action_embedding = self.fast_action_embedding.to(dtype=torch.float32) | |
| # self.fast_action_lm_head = self.fast_action_lm_head.to(dtype=torch.float32) |
| # from transformers import AutoTokenizer | ||
| # self._paligemma_tokenizer = AutoTokenizer.from_pretrained( | ||
| # "google/paligemma-3b-pt-224", | ||
| # trust_remote_code=True, | ||
| # add_eos_token=True, | ||
| # add_bos_token=False | ||
| # ) | ||
| # # remove | ||
| # decoded_tokens = [ | ||
| # self._paligemma_tokenizer.convert_ids_to_tokens(seq.tolist()) | ||
| # for seq in fast_targets | ||
| # ] | ||
| # corrected_tokens = [ | ||
| # self._paligemma_tokenizer.convert_ids_to_tokens(seq.tolist()) | ||
| # for seq in fast_logits_for_pred.argmax(dim=-1) | ||
| # ] | ||
| # breakpoint() | ||
|
|
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a large block of commented-out debugging code that should be removed before merging. Commented code like this makes the codebase harder to maintain and should be deleted or moved to proper debug utilities if needed.
| # from transformers import AutoTokenizer | |
| # self._paligemma_tokenizer = AutoTokenizer.from_pretrained( | |
| # "google/paligemma-3b-pt-224", | |
| # trust_remote_code=True, | |
| # add_eos_token=True, | |
| # add_bos_token=False | |
| # ) | |
| # # remove | |
| # decoded_tokens = [ | |
| # self._paligemma_tokenizer.convert_ids_to_tokens(seq.tolist()) | |
| # for seq in fast_targets | |
| # ] | |
| # corrected_tokens = [ | |
| # self._paligemma_tokenizer.convert_ids_to_tokens(seq.tolist()) | |
| # for seq in fast_logits_for_pred.argmax(dim=-1) | |
| # ] | |
| # breakpoint() | |
|
|
||
| # Get optional parameters | ||
| temperature = kwargs.get("temperature", 0.0) | ||
| max_decoding_steps = 256 |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded value of 256 for max_decoding_steps should be made configurable or derived from the configuration. This should use self.config.max_action_tokens or a similar configuration parameter instead of a magic number.
| max_decoding_steps = 256 | |
| max_decoding_steps = getattr(self.config, "max_action_tokens", 256) or 256 |
| if tasks is None: | ||
| raise ValueError("No task found in complementary data") | ||
|
|
||
| # TODO: check if this necessary |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment has a typo: 'mistaken' should be 'mistake'. This is part of the documentation explaining when to differentiate comment updates from code changes.
| # TODO: check if this necessary | |
| # TODO: check if this is necessary |
Title
feat(policies): add autoregressive VLAs with tokenization PiFast
This PR brings autoregressive Vision-Language-Action (VLA) models back to LeRobot, alongside the existing flow-matching–based policies.
Unlike flow matching, which predicts actions in parallel over a horizon, autoregressive VLAs model actions sequentially as discrete tokens.
As a first step toward supporting multiple action tokenizers, this PR introduces PiFast, together with a training script for FAST tokenization, this provides a concrete reference implementation for autoregressive action modeling in LeRobot.
Future work will extend this framework to additional tokenizers and autoregressive variants.
TODO:
1- Support KV-caching for faster inference (a must for this PR) https://mett29.github.io/posts/kv-cache/
2- Provide PiFast pretrained checkpoints, and unveil HF LeRobot new AR VLA work.
3- Add testing and docs.
DONE:
1- Trained and evaluated successfully on libero, we will share the ckpts along with the results.