Model Training
1. ACT
Below is a training example for Task 1 using the ACT policy, with full hyperparameters:
# Minimal training command
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
--dataset.repo_id=your_org/your_dataset \
--dataset.video_backend=pyav \
--policy.type=act \
--output_dir=challenge2026_baseline/Part_Sorting/act \
--dataset.root=datasets/Part_Sorting/ \
--job_name=part_sorting_act \
--policy.device=cuda \
--wandb.enable=false \
--policy.repo_id=none \
--policy.push_to_hub=false
# Detailed training command (task1)
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
--dataset.repo_id=your_org/your_dataset \
--dataset.root=datasets/Part_Sorting/ \
--dataset.video_backend=pyav \
--policy.type=act \
--policy.n_obs_steps=1 \
--policy.chunk_size=50 \
--policy.n_action_steps=50 \
--policy.vision_backbone=resnet18 \
--policy.pretrained_backbone_weights=ResNet18_Weights.IMAGENET1K_V1 \
--policy.dim_model=256 \
--policy.n_heads=4 \
--policy.dim_feedforward=1024 \
--policy.n_encoder_layers=4 \
--policy.n_decoder_layers=1 \
--policy.use_vae=true \
--policy.latent_dim=32 \
--policy.n_vae_encoder_layers=4 \
--policy.dropout=0.1 \
--policy.kl_weight=10.0 \
--policy.optimizer_lr=1e-5 \
--policy.optimizer_weight_decay=1e-4 \
--policy.optimizer_lr_backbone=1e-5 \
--policy.device=cuda \
--policy.use_amp=true \
--policy.push_to_hub=false \
--output_dir=challenge2026_baseline/Part_Sorting/act \
--job_name=part_sorting_act \
--resume=false \
--seed=1000 \
--num_workers=8 \
--batch_size=8 \
--steps=100000 \
--eval_freq=0 \
--log_freq=200 \
--save_checkpoint=true \
--save_freq=5000 \
--wandb.entity=your_wandb_entity
Replace
your_org/your_datasetwith your own dataset repo ID, replacechallenge2026_baseline/Part_Sorting/actwith your own output path, and replaceyour_wandb_entitywith your WandB username or team name. If you don't use WandB, you can remove the--wandb.entityargument. Training ACT will downloadresnet18-f37072fd.pth.
Dataset & output arguments
| Argument | Description | Default / Notes |
|---|---|---|
dataset.repo_id | Dataset ID (Hugging Face or local org name) | Required |
dataset.root | Local dataset root path | Required |
output_dir | Directory to save checkpoints and logs | Required |
job_name | Run identifier (shown in logs / WandB) | Optional |
resume | Resume training from the last checkpoint | false |
seed | Global random seed | 1000 |
Training loop arguments
| Argument | Description | Default / Notes |
|---|---|---|
steps | Total training steps | 100000 |
batch_size | Batch size | 8 |
num_workers | DataLoader worker processes | 8 |
eval_freq | Evaluation interval (0 disables) | 0 |
log_freq | Log print interval | 200 |
save_checkpoint | Whether to save checkpoints | true |
save_freq | Checkpoint save interval (steps) | 5000 |
ACT policy arguments
| Argument | Description | Default / Notes |
|---|---|---|
policy.type | Policy algorithm type | act / pi0 |
policy.device | Device | cuda / cpu |
policy.use_amp | Enable mixed-precision training | true |
policy.n_obs_steps | Number of observation steps | 1 |
policy.chunk_size | Action chunk length | 50 |
policy.n_action_steps | Action steps executed per inference | 50 |
policy.vision_backbone | Vision encoder backbone | resnet18 |
policy.pretrained_backbone_weights | Backbone pretrained weights | ResNet18_Weights.IMAGENET1K_V1 |
policy.dim_model | Transformer model dimension | 256 |
policy.n_heads | Number of attention heads | 4 |
policy.dim_feedforward | Feed-forward dimension | 1024 |
policy.n_encoder_layers | Encoder layers | 4 |
2. Diffusion Policy (DP)
Below is a training example using Diffusion Policy, with full hyperparameters:
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
--dataset.repo_id=your_org/your_dataset \
--dataset.root=datasets/Part_Sorting \
--dataset.video_backend=pyav \
--output_dir=challenge2026_baseline/Part_Sorting/diffusion \
--policy.repo_id=none \
--policy.type=diffusion \
--policy.n_obs_steps=2 \
--policy.horizon=16 \
--policy.n_action_steps=8 \
--policy.vision_backbone=resnet18 \
--policy.pretrained_backbone_weights=null \
--policy.resize_shape=null \
--policy.crop_ratio=1.0 \
--policy.crop_shape=null \
--policy.crop_is_random=true \
--policy.use_group_norm=true \
--policy.spatial_softmax_num_keypoints=32 \
--policy.use_separate_rgb_encoder_per_camera=false \
--policy.down_dims='[512,1024,2048]' \
--policy.kernel_size=5 \
--policy.n_groups=8 \
--policy.diffusion_step_embed_dim=128 \
--policy.use_film_scale_modulation=true \
--policy.noise_scheduler_type=DDPM \
--policy.num_train_timesteps=100 \
--policy.beta_schedule=squaredcos_cap_v2 \
--policy.beta_start=0.0001 \
--policy.beta_end=0.02 \
--policy.prediction_type=epsilon \
--policy.clip_sample=true \
--policy.clip_sample_range=1.0 \
--policy.num_inference_steps=null \
--policy.compile_model=false \
--policy.compile_mode=reduce-overhead \
--policy.do_mask_loss_for_padding=false \
--policy.optimizer_lr=1e-4 \
--policy.optimizer_betas='[0.95,0.999]' \
--policy.optimizer_eps=1e-8 \
--policy.optimizer_weight_decay=1e-6 \
--policy.scheduler_name=cosine \
--policy.scheduler_warmup_steps=500 \
--job_name=part_sorting_diffusion \
--resume=false \
--seed=1000 \
--num_workers=8 \
--batch_size=32 \
--steps=100000 \
--eval_freq=0 \
--log_freq=200 \
--save_checkpoint=true \
--save_freq=5000
Replace
your_org/your_datasetwith your own dataset repo ID, and replacechallenge2026_baseline/Part_Sorting/diffusionwith your own output path.
Diffusion Policy arguments - input/output structure
| Argument | Description | Default / Notes |
|---|---|---|
policy.type | Policy algorithm type | diffusion |
policy.n_obs_steps | Number of observation steps | 2 |
policy.horizon | Action prediction horizon | 16 |
policy.n_action_steps | Action steps executed per inference | 8 |
Diffusion Policy arguments - vision backbone
| Argument | Description | Default / Notes |
|---|---|---|
policy.vision_backbone | Vision encoder backbone | resnet18 |
policy.pretrained_backbone_weights | Backbone pretrained weights | null |
policy.resize_shape | Resize shape (H, W) | null |
policy.crop_ratio | Crop ratio (0, 1] | 1.0 |
policy.crop_shape | Crop shape (H, W) | null |
policy.crop_is_random | Random crop (during training) | true |
policy.use_group_norm | Use GroupNorm instead of BN | true |
policy.spatial_softmax_num_keypoints | Number of SpatialSoftmax keypoints | 32 |
policy.use_separate_rgb_encoder_per_camera | Separate encoder per camera | false |
Diffusion Policy arguments - UNet architecture
| Argument | Description | Default / Notes |
|---|---|---|
policy.down_dims | UNet downsampling dims | [512,1024,2048] |
policy.kernel_size | Convolution kernel size | 5 |
policy.n_groups | GroupNorm groups | 8 |
policy.diffusion_step_embed_dim | Diffusion step embedding dim | 128 |
policy.use_film_scale_modulation | Use FiLM scale modulation | true |
Diffusion Policy arguments - noise scheduler
| Argument | Description | Default / Notes |
|---|---|---|
policy.noise_scheduler_type | Scheduler type | DDPM / DDIM |
policy.num_train_timesteps | Diffusion steps (train) | 100 |
policy.beta_schedule | Beta schedule | squaredcos_cap_v2 |
policy.beta_start | Beta start | 0.0001 |
policy.beta_end | Beta end | 0.02 |
policy.prediction_type | Prediction type | epsilon / sample |
policy.clip_sample | Clip samples | true |
policy.clip_sample_range | Clip range | 1.0 |
policy.num_inference_steps | Inference steps | null (same as train steps) |
Diffusion Policy arguments - optimizer & scheduler
| Argument | Description | Default / Notes |
|---|---|---|
policy.optimizer_lr | Learning rate | 1e-4 |
policy.optimizer_betas | Adam betas | [0.95,0.999] |
policy.optimizer_eps | Adam eps | 1e-8 |
policy.optimizer_weight_decay | Weight decay | 1e-6 |
policy.scheduler_name | LR scheduler | cosine |
policy.scheduler_warmup_steps | Warmup steps | 500 |
Diffusion Policy arguments - other
| Argument | Description | Default / Notes |
|---|---|---|
policy.compile_model | Compile model | false |
policy.compile_mode | Compile mode | reduce-overhead |
policy.do_mask_loss_for_padding | Mask padding loss | false |
3. π₀ (PI0)
Download pretrained weights:
# Download pretrained weights
hf download \
lerobot/pi0_base \
--local-dir pretrained/pi0_base
hf download \
lerobot/pi05_base \
--local-dir pretrained/pi05_base
hf download google/paligemma-3b-pt-224 \
--local-dir pretrained/paligemma-3b-pt-224
Search in /workspace/GlobalHumanoidRobotChallenge_2026_Baseline/src/lerobot/processor/tokenizer_processor.py and replace the code block with:
if self.tokenizer is not None:
# Use provided tokenizer object directly
self.input_tokenizer = self.tokenizer
elif self.tokenizer_name is not None:
if AutoTokenizer is None:
raise ImportError("AutoTokenizer is not available")
# If tokenizer_name contains paligemma, it is a pi0 model; force local offline loading
if "paligemma" in self.tokenizer_name.lower():
self.input_tokenizer = AutoTokenizer.from_pretrained(
"/root/.cache/huggingface/hub/models--google--paligemma-3b-pt-224/snapshots/35e4f46485b4d07967e7e9935bc3786aad50687c",
local_files_only=True
)
else:
# Otherwise (e.g., act/smolvla), load normally using the provided tokenizer_name
self.input_tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name)
else:
raise ValueError(
"Either 'tokenizer' or 'tokenizer_name' must be provided. "
"Pass a tokenizer object directly or a tokenizer name to auto-load."
)
Below is a training example using π₀ (PI0), with full hyperparameters:
# 简洁训练命令
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
--policy.path=lerobot/pi0_base \
--dataset.repo_id=your_org/your_dataset \
--batch_size=64 \
--steps=20000 \
--output_dir=challenge2026_baseline/Part_Sorting/pi0 \
--job_name=part_sorting_pi0 \
--policy.device=cuda \
--wandb.enable=true
# 详细训练命令
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
--dataset.repo_id=your_org/your_dataset \
--dataset.root=datasets/Part_Sorting \
--policy.type=pi0 \
--policy.paligemma_variant=gemma_2b \
--policy.action_expert_variant=gemma_300m \
--policy.dtype=float32 \
--policy.n_obs_steps=1 \
--policy.chunk_size=50 \
--policy.n_action_steps=50 \
--policy.max_state_dim=32 \
--policy.max_action_dim=32 \
--policy.num_inference_steps=10 \
--policy.time_sampling_beta_alpha=1.5 \
--policy.time_sampling_beta_beta=1.0 \
--policy.time_sampling_scale=0.999 \
--policy.time_sampling_offset=0.001 \
--policy.min_period=0.004 \
--policy.max_period=4.0 \
--policy.image_resolution='[224,224]' \
--policy.empty_cameras=0 \
--policy.gradient_checkpointing=false \
--policy.compile_model=false \
--policy.compile_mode=max-autotune \
--policy.freeze_vision_encoder=false \
--policy.train_expert_only=false \
--policy.optimizer_lr=2.5e-5 \
--policy.optimizer_betas='[0.9,0.95]' \
--policy.optimizer_eps=1e-8 \
--policy.optimizer_weight_decay=0.01 \
--policy.optimizer_grad_clip_norm=1.0 \
--policy.scheduler_warmup_steps=1000 \
--policy.scheduler_decay_steps=30000 \
--policy.scheduler_decay_lr=2.5e-6 \
--policy.tokenizer_max_length=48 \
--output_dir=challenge2026_baseline/Part_Sorting/pi0 \
--job_name=part_sorting_pi0 \
--resume=false \
--seed=1000 \
--num_workers=8 \
--batch_size=8 \
--steps=100000 \
--eval_freq=0 \
--log_freq=200 \
--save_checkpoint=true \
--save_freq=5000 \
--wandb.entity=your_wandb_entity
Replace
your_org/your_datasetwith your own dataset repo ID, replacechallenge2026_baseline/Part_Sorting/pi0with your own output path, and replaceyour_wandb_entitywith your WandB username or team name. If you don't use WandB, you can remove the--wandb.entityargument.
π₀ policy arguments - model architecture
| Argument | Description | Default / Notes |
|---|---|---|
policy.type | Policy algorithm type | pi0 |
policy.paligemma_variant | PaliGemma variant | gemma_2b |
policy.action_expert_variant | Action Expert variant | gemma_300m |
policy.dtype | Data type | float32 |
π₀ policy arguments - input/output structure
| Argument | Description | Default / Notes |
|---|---|---|
policy.n_obs_steps | Number of observation steps | 1 |
policy.chunk_size | Action chunk size | 50 |
policy.n_action_steps | Action steps executed | 50 |
policy.max_state_dim | Max state dim (padded to) | 32 |
policy.max_action_dim | Max action dim (padded to) | 32 |
π₀ policy arguments - flow matching
| Argument | Description | Default / Notes |
|---|---|---|
policy.num_inference_steps | Denoising steps (inference) | 10 |
policy.time_sampling_beta_alpha | Time-sampling beta α | 1.5 |
policy.time_sampling_beta_beta | Time-sampling beta β | 1.0 |
policy.time_sampling_scale | Time-sampling scale | 0.999 |
policy.time_sampling_offset | Time-sampling offset | 0.001 |
policy.min_period | Minimum period | 0.004 |
policy.max_period | Maximum period | 4.0 |
π₀ policy arguments - images & cameras
| Argument | Description | Default / Notes |
|---|---|---|
policy.image_resolution | Image resolution (H, W) | [224,224] |
policy.empty_cameras | Number of empty cameras | 0 |
π₀ policy arguments - training settings
| Argument | Description | Default / Notes |
|---|---|---|
policy.gradient_checkpointing | Enable gradient checkpointing | false |
policy.compile_model | Compile model | false |
policy.compile_mode | Compile mode | max-autotune |
π₀ policy arguments - fine-tuning
| Argument | Description | Default / Notes |
|---|---|---|
policy.freeze_vision_encoder | Freeze vision encoder | false |
policy.train_expert_only | Train Action Expert only | false |
π₀ policy arguments - optimizer
| Argument | Description | Default / Notes |
|---|---|---|
policy.optimizer_lr | Learning rate | 2.5e-5 |
policy.optimizer_betas | AdamW betas | [0.9,0.95] |
policy.optimizer_eps | AdamW eps | 1e-8 |
policy.optimizer_weight_decay | Weight decay | 0.01 |
policy.optimizer_grad_clip_norm | Gradient clip norm | 1.0 |
π₀ policy arguments - LR scheduler
| Argument | Description | Default / Notes |
|---|---|---|
policy.scheduler_warmup_steps | Warmup steps | 1000 |
policy.scheduler_decay_steps | Decay steps | 30000 |
policy.scheduler_decay_lr | Decay learning rate | 2.5e-6 |
π₀ policy arguments - tokenizer
| Argument | Description | Default / Notes |
|---|---|---|
policy.tokenizer_max_length | Max tokenizer length | 48 |
4. π₀.₅ (PI05)
Below is a training example using π₀.₅ (PI05) with full hyperparameters. π₀.₅ is an enhanced version of π₀ that supports open-world generalization. Key differences include QUANTILES normalization, a longer tokenizer length, and AdaRMS conditioning.
# 简洁训练命令
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
--dataset.repo_id=your_org/your_dataset \
--policy.type=pi05 \
--output_dir=challenge2026_baseline/Part_Sorting/pi05 \
--job_name=part_sorting_pi05 \
--policy.repo_id=your_repo_id \
--policy.pretrained_path=lerobot/pi05_base \
--policy.compile_model=true \
--policy.gradient_checkpointing=true \
--wandb.enable=true \
--policy.dtype=bfloat16 \
--policy.freeze_vision_encoder=false \
--policy.train_expert_only=false \
--steps=3000 \
--policy.device=cuda \
--batch_size=32
# 详细训练命令
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
--dataset.repo_id=your_org/your_dataset \
--dataset.root=datasets/Part_Sorting/ \
--policy.type=pi05 \
--policy.paligemma_variant=gemma_2b \
--policy.action_expert_variant=gemma_300m \
--policy.dtype=float32 \
--policy.n_obs_steps=1 \
--policy.chunk_size=50 \
--policy.n_action_steps=50 \
--policy.max_state_dim=32 \
--policy.max_action_dim=32 \
--policy.num_inference_steps=10 \
--policy.time_sampling_beta_alpha=1.5 \
--policy.time_sampling_beta_beta=1.0 \
--policy.time_sampling_scale=0.999 \
--policy.time_sampling_offset=0.001 \
--policy.min_period=0.004 \
--policy.max_period=4.0 \
--policy.image_resolution='[224,224]' \
--policy.empty_cameras=0 \
--policy.gradient_checkpointing=false \
--policy.compile_model=false \
--policy.compile_mode=max-autotune \
--policy.freeze_vision_encoder=false \
--policy.train_expert_only=false \
--policy.optimizer_lr=2.5e-5 \
--policy.optimizer_betas='[0.9,0.95]' \
--policy.optimizer_eps=1e-8 \
--policy.optimizer_weight_decay=0.01 \
--policy.optimizer_grad_clip_norm=1.0 \
--policy.scheduler_warmup_steps=1000 \
--policy.scheduler_decay_steps=30000 \
--policy.scheduler_decay_lr=2.5e-6 \
--policy.tokenizer_max_length=200 \
--output_dir=challenge2026_baseline/Part_Sorting/pi05 \
--job_name=part_sorting_pi05 \
--resume=false \
--seed=1000 \
--num_workers=8 \
--batch_size=8 \
--steps=100000 \
--eval_freq=0 \
--log_freq=200 \
--save_checkpoint=true \
--save_freq=5000 \
--wandb.entity=your_wandb_entity
Replace
your_org/your_datasetwith your own dataset repo ID, replacechallenge2026_baseline/Part_Sorting/pi05with your own output path, and replaceyour_wandb_entitywith your WandB username or team name. If you don't use WandB, you can remove the--wandb.entityargument.
π₀.₅ policy arguments - model architecture
| Argument | Description | Default / Notes |
|---|---|---|
policy.type | Policy algorithm type | pi05 |
policy.paligemma_variant | PaliGemma variant | gemma_2b |
policy.action_expert_variant | Action Expert variant | gemma_300m |
policy.dtype | Data type | float32 |
π₀.₅ policy arguments - input/output structure
| Argument | Description | Default / Notes |
|---|---|---|
policy.n_obs_steps | Number of observation steps | 1 |
policy.chunk_size | Action chunk size | 50 |
policy.n_action_steps | Action steps executed | 50 |
policy.max_state_dim | Max state dim (padded to) | 32 |
policy.max_action_dim | Max action dim (padded to) | 32 |
π₀.₅ policy arguments - flow matching
| Argument | Description | Default / Notes |
|---|---|---|
policy.num_inference_steps | Denoising steps (inference) | 10 |
policy.time_sampling_beta_alpha | Time-sampling beta α | 1.5 |
policy.time_sampling_beta_beta | Time-sampling beta β | 1.0 |
policy.time_sampling_scale | Time-sampling scale | 0.999 |
policy.time_sampling_offset | Time-sampling offset | 0.001 |
policy.min_period | Minimum period | 0.004 |
policy.max_period | Maximum period | 4.0 |
π₀.₅ policy arguments - images & cameras
| Argument | Description | Default / Notes |
|---|---|---|
policy.image_resolution | Image resolution (H, W) | [224,224] |
policy.empty_cameras | Number of empty cameras | 0 |
π₀.₅ policy arguments - training settings
| Argument | Description | Default / Notes |
|---|---|---|
policy.gradient_checkpointing | Enable gradient checkpointing | false |
policy.compile_model | Compile model | false |
policy.compile_mode | Compile mode | max-autotune |
π₀.₅ policy arguments - fine-tuning
| Argument | Description | Default / Notes |
|---|---|---|
policy.freeze_vision_encoder | Freeze vision encoder | false |
policy.train_expert_only | Train Action Expert only | false |
π₀.₅ policy arguments - optimizer
| Argument | Description | Default / Notes |
|---|---|---|
policy.optimizer_lr | Learning rate | 2.5e-5 |
policy.optimizer_betas | AdamW betas | [0.9,0.95] |
policy.optimizer_eps | AdamW eps | 1e-8 |
policy.optimizer_weight_decay | Weight decay | 0.01 |
policy.optimizer_grad_clip_norm | Gradient clip norm | 1.0 |
π₀.₅ policy arguments - LR scheduler
| Argument | Description | Default / Notes |
|---|---|---|
policy.scheduler_warmup_steps | Warmup steps | 1000 |
policy.scheduler_decay_steps | Decay steps | 30000 |
policy.scheduler_decay_lr | Decay learning rate | 2.5e-6 |
π₀.₅ policy arguments - tokenizer
| Argument | Description | Default / Notes |
|---|---|---|
policy.tokenizer_max_length | Max tokenizer length | 200 (π₀ uses 48) |
Key differences between π₀ and π₀.₅
| Feature | π₀ | π₀.₅ |
|---|---|---|
| Time conditioning injection | Concatenate time and action via action_time_mlp_* | AdaRMS conditioning via time_mlp_* |
| AdaRMS | Not used | Used in Action Expert |
| Tokenizer length | 48 tokens | 200 tokens |
| Discrete state input | False (uses state_proj layer) | True |
| Parameter count | Higher (includes state embedding) | Lower (no state embedding) |
| State normalization | MEAN_STD | QUANTILES |
| Action normalization | MEAN_STD | QUANTILES |
5. SmolVLA
Below is a fine-tuning example using the SmolVLA policy. SmolVLA is built on the SmolVLM2-500M-Video-Instruct vision-language model and supports open-world generalization.
# 简洁训练命令
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=your_org/your_dataset \
--batch_size=64 \
--steps=20000 \
--output_dir=challenge2026_baseline/Part_Sorting/smolvla \
--job_name=part_sorting_smolvla \
--policy.device=cuda \
--wandb.enable=true
# 详细训练命令
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
--dataset.repo_id=your_org/your_dataset \
--dataset.root=datasets/Part_Sorting/ \
--policy.type=smolvla \
--policy.vlm_model_name=HuggingFaceTB/SmolVLM2-500M-Video-Instruct \
--policy.load_vlm_weights=true \
--policy.dtype=float32 \
--policy.n_obs_steps=1 \
--policy.chunk_size=50 \
--policy.n_action_steps=50 \
--policy.max_state_dim=32 \
--policy.max_action_dim=32 \
--policy.num_steps=10 \
--policy.tokenizer_max_length=48 \
--policy.image_resolution='[224,224]' \
--policy.empty_cameras=0 \
--policy.freeze_vision_encoder=true \
--policy.train_expert_only=true \
--policy.train_state_proj=true \
--policy.gradient_checkpointing=false \
--policy.compile_model=false \
--policy.compile_mode=max-autotune \
--policy.attention_mode=cross_attn \
--policy.num_vlm_layers=16 \
--policy.self_attn_every_n_layers=2 \
--policy.expert_width_multiplier=0.75 \
--policy.optimizer_lr=1e-4 \
--policy.optimizer_betas='[0.9,0.95]' \
--policy.optimizer_eps=1e-8 \
--policy.optimizer_weight_decay=1e-10 \
--policy.optimizer_grad_clip_norm=10.0 \
--policy.scheduler_warmup_steps=1000 \
--policy.scheduler_decay_steps=30000 \
--policy.scheduler_decay_lr=2.5e-6 \
--policy.min_period=0.004 \
--policy.max_period=4.0 \
--output_dir=challenge2026_baseline/Part_Sorting/smolvla \
--job_name=part_sorting_smolvla \
--resume=false \
--seed=1000 \
--num_workers=8 \
--batch_size=8 \
--steps=100000 \
--eval_freq=0 \
--log_freq=200 \
--save_checkpoint=true \
--save_freq=5000 \
--wandb.entity=your_wandb_entity
Replace
your_org/your_datasetwith your own dataset repo ID, replacechallenge2026_baseline/Part_Sorting/smolvlawith your own output path, and replaceyour_wandb_entitywith your WandB username or team name. If you don't use WandB, you can remove the--wandb.entityargument.
SmolVLA policy arguments - model architecture
| Argument | Description | Default / Notes |
|---|---|---|
policy.type | Policy algorithm type | smolvla |
policy.vlm_model_name | VLM backbone | HuggingFaceTB/SmolVLM2-500M-Video-Instruct |
policy.load_vlm_weights | Load pretrained VLM weights | true |
policy.dtype | Data type | float32 |
SmolVLA policy arguments - input/output structure
| Argument | Description | Default / Notes |
|---|---|---|
policy.n_obs_steps | Number of observation steps | 1 |
policy.chunk_size | Action chunk size | 50 |
policy.n_action_steps | Action steps executed | 50 |
policy.max_state_dim | Max state dim (padded to) | 32 |
policy.max_action_dim | Max action dim (padded to) | 32 |
SmolVLA policy arguments - decoding & tokenizer
| Argument | Description | Default / Notes |
|---|---|---|
policy.num_steps | Denoising steps (inference) | 10 |
policy.tokenizer_max_length | Max tokenizer length | 48 |
policy.use_cache | Use attention cache | true |
SmolVLA policy arguments - images & cameras
| Argument | Description | Default / Notes |
|---|---|---|
policy.image_resolution | Image preprocessing resolution (H, W) | [224,224] |
policy.empty_cameras | Number of empty cameras | 0 |
policy.add_image_special_tokens | Use image special tokens | false |
SmolVLA policy arguments - fine-tuning
| Argument | Description | Default / Notes |
|---|---|---|
policy.freeze_vision_encoder | Freeze vision encoder | true |
policy.train_expert_only | Train Action Expert only | true |
policy.train_state_proj | Train state projection | true |
SmolVLA policy arguments - transformer architecture
| Argument | Description | Default / Notes |
|---|---|---|
policy.attention_mode | Attention mode | cross_attn |
policy.num_vlm_layers | Number of VLM layers used | 16 |
policy.self_attn_every_n_layers | Insert self-attention every N layers | 2 |
policy.expert_width_multiplier | Action Expert hidden width multiplier | 0.75 |
SmolVLA policy arguments - optimizer
| Argument | Description | Default / Notes |
|---|---|---|
policy.optimizer_lr | Learning rate | 1e-4 |
policy.optimizer_betas | AdamW betas | [0.9,0.95] |
policy.optimizer_eps | AdamW eps | 1e-8 |
policy.optimizer_weight_decay | Weight decay | 1e-10 |
policy.optimizer_grad_clip_norm | Gradient clip norm | 10.0 |
SmolVLA policy arguments - LR scheduler
| Argument | Description | Default / Notes |
|---|---|---|
policy.scheduler_warmup_steps | Warmup steps | 1000 |
policy.scheduler_decay_steps | Decay steps | 30000 |
policy.scheduler_decay_lr | Decay learning rate | 2.5e-6 |
SmolVLA policy arguments - training settings
| Argument | Description | Default / Notes |
|---|---|---|
policy.gradient_checkpointing | Enable gradient checkpointing | false |
policy.compile_model | Compile model | false |
policy.compile_mode | Compile mode | max-autotune |