Model Training

1. ACT

Below is a training example for Task 1 using the ACT policy, with full hyperparameters:

# Minimal training command
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
    --dataset.repo_id=your_org/your_dataset \
    --dataset.video_backend=pyav \
    --policy.type=act \
    --output_dir=challenge2026_baseline/Part_Sorting/act \
    --dataset.root=datasets/Part_Sorting/ \
    --job_name=part_sorting_act \
    --policy.device=cuda \
    --wandb.enable=false \
    --policy.repo_id=none \
    --policy.push_to_hub=false 


# Detailed training command (task1)
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
  --dataset.repo_id=your_org/your_dataset \
  --dataset.root=datasets/Part_Sorting/ \
  --dataset.video_backend=pyav \
  --policy.type=act \
  --policy.n_obs_steps=1 \
  --policy.chunk_size=50 \
  --policy.n_action_steps=50 \
  --policy.vision_backbone=resnet18 \
  --policy.pretrained_backbone_weights=ResNet18_Weights.IMAGENET1K_V1 \
  --policy.dim_model=256 \
  --policy.n_heads=4 \
  --policy.dim_feedforward=1024 \
  --policy.n_encoder_layers=4 \
  --policy.n_decoder_layers=1 \
  --policy.use_vae=true \
  --policy.latent_dim=32 \
  --policy.n_vae_encoder_layers=4 \
  --policy.dropout=0.1 \
  --policy.kl_weight=10.0 \
  --policy.optimizer_lr=1e-5 \
  --policy.optimizer_weight_decay=1e-4 \
  --policy.optimizer_lr_backbone=1e-5 \
  --policy.device=cuda \
  --policy.use_amp=true \
  --policy.push_to_hub=false \
  --output_dir=challenge2026_baseline/Part_Sorting/act \
  --job_name=part_sorting_act \
  --resume=false \
  --seed=1000 \
  --num_workers=8 \
  --batch_size=8 \
  --steps=100000 \
  --eval_freq=0 \
  --log_freq=200 \
  --save_checkpoint=true \
  --save_freq=5000 \
  --wandb.entity=your_wandb_entity

Replace your_org/your_dataset with your own dataset repo ID, replace challenge2026_baseline/Part_Sorting/act with your own output path, and replace your_wandb_entity with your WandB username or team name. If you don't use WandB, you can remove the --wandb.entity argument. Training ACT will download resnet18-f37072fd.pth.

Dataset & output arguments

Argument	Description	Default / Notes
`dataset.repo_id`	Dataset ID (Hugging Face or local org name)	Required
`dataset.root`	Local dataset root path	Required
`output_dir`	Directory to save checkpoints and logs	Required
`job_name`	Run identifier (shown in logs / WandB)	Optional
`resume`	Resume training from the last checkpoint	`false`
`seed`	Global random seed	`1000`

Training loop arguments

Argument	Description	Default / Notes
`steps`	Total training steps	`100000`
`batch_size`	Batch size	`8`
`num_workers`	DataLoader worker processes	`8`
`eval_freq`	Evaluation interval (0 disables)	`0`
`log_freq`	Log print interval	`200`
`save_checkpoint`	Whether to save checkpoints	`true`
`save_freq`	Checkpoint save interval (steps)	`5000`

ACT policy arguments

Argument	Description	Default / Notes
`policy.type`	Policy algorithm type	`act` / `pi0`
`policy.device`	Device	`cuda` / `cpu`
`policy.use_amp`	Enable mixed-precision training	`true`
`policy.n_obs_steps`	Number of observation steps	`1`
`policy.chunk_size`	Action chunk length	`50`
`policy.n_action_steps`	Action steps executed per inference	`50`
`policy.vision_backbone`	Vision encoder backbone	`resnet18`
`policy.pretrained_backbone_weights`	Backbone pretrained weights	`ResNet18_Weights.IMAGENET1K_V1`
`policy.dim_model`	Transformer model dimension	`256`
`policy.n_heads`	Number of attention heads	`4`
`policy.dim_feedforward`	Feed-forward dimension	`1024`
`policy.n_encoder_layers`	Encoder layers	`4`

2. Diffusion Policy (DP)

Below is a training example using Diffusion Policy, with full hyperparameters:

/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
  --dataset.repo_id=your_org/your_dataset \
  --dataset.root=datasets/Part_Sorting \
  --dataset.video_backend=pyav \
  --output_dir=challenge2026_baseline/Part_Sorting/diffusion \
  --policy.repo_id=none \
  --policy.type=diffusion \
  --policy.n_obs_steps=2 \
  --policy.horizon=16 \
  --policy.n_action_steps=8 \
  --policy.vision_backbone=resnet18 \
  --policy.pretrained_backbone_weights=null \
  --policy.resize_shape=null \
  --policy.crop_ratio=1.0 \
  --policy.crop_shape=null \
  --policy.crop_is_random=true \
  --policy.use_group_norm=true \
  --policy.spatial_softmax_num_keypoints=32 \
  --policy.use_separate_rgb_encoder_per_camera=false \
  --policy.down_dims='[512,1024,2048]' \
  --policy.kernel_size=5 \
  --policy.n_groups=8 \
  --policy.diffusion_step_embed_dim=128 \
  --policy.use_film_scale_modulation=true \
  --policy.noise_scheduler_type=DDPM \
  --policy.num_train_timesteps=100 \
  --policy.beta_schedule=squaredcos_cap_v2 \
  --policy.beta_start=0.0001 \
  --policy.beta_end=0.02 \
  --policy.prediction_type=epsilon \
  --policy.clip_sample=true \
  --policy.clip_sample_range=1.0 \
  --policy.num_inference_steps=null \
  --policy.compile_model=false \
  --policy.compile_mode=reduce-overhead \
  --policy.do_mask_loss_for_padding=false \
  --policy.optimizer_lr=1e-4 \
  --policy.optimizer_betas='[0.95,0.999]' \
  --policy.optimizer_eps=1e-8 \
  --policy.optimizer_weight_decay=1e-6 \
  --policy.scheduler_name=cosine \
  --policy.scheduler_warmup_steps=500 \
  --job_name=part_sorting_diffusion \
  --resume=false \
  --seed=1000 \
  --num_workers=8 \
  --batch_size=32 \
  --steps=100000 \
  --eval_freq=0 \
  --log_freq=200 \
  --save_checkpoint=true \
  --save_freq=5000

Replace your_org/your_dataset with your own dataset repo ID, and replace challenge2026_baseline/Part_Sorting/diffusion with your own output path.

Diffusion Policy arguments - input/output structure

Argument	Description	Default / Notes
`policy.type`	Policy algorithm type	`diffusion`
`policy.n_obs_steps`	Number of observation steps	`2`
`policy.horizon`	Action prediction horizon	`16`
`policy.n_action_steps`	Action steps executed per inference	`8`

Diffusion Policy arguments - vision backbone

Argument	Description	Default / Notes
`policy.vision_backbone`	Vision encoder backbone	`resnet18`
`policy.pretrained_backbone_weights`	Backbone pretrained weights	`null`
`policy.resize_shape`	Resize shape (H, W)	`null`
`policy.crop_ratio`	Crop ratio (0, 1]	`1.0`
`policy.crop_shape`	Crop shape (H, W)	`null`
`policy.crop_is_random`	Random crop (during training)	`true`
`policy.use_group_norm`	Use GroupNorm instead of BN	`true`
`policy.spatial_softmax_num_keypoints`	Number of SpatialSoftmax keypoints	`32`
`policy.use_separate_rgb_encoder_per_camera`	Separate encoder per camera	`false`

Diffusion Policy arguments - UNet architecture

Argument	Description	Default / Notes
`policy.down_dims`	UNet downsampling dims	`[512,1024,2048]`
`policy.kernel_size`	Convolution kernel size	`5`
`policy.n_groups`	GroupNorm groups	`8`
`policy.diffusion_step_embed_dim`	Diffusion step embedding dim	`128`
`policy.use_film_scale_modulation`	Use FiLM scale modulation	`true`

Diffusion Policy arguments - noise scheduler

Argument	Description	Default / Notes
`policy.noise_scheduler_type`	Scheduler type	`DDPM` / `DDIM`
`policy.num_train_timesteps`	Diffusion steps (train)	`100`
`policy.beta_schedule`	Beta schedule	`squaredcos_cap_v2`
`policy.beta_start`	Beta start	`0.0001`
`policy.beta_end`	Beta end	`0.02`
`policy.prediction_type`	Prediction type	`epsilon` / `sample`
`policy.clip_sample`	Clip samples	`true`
`policy.clip_sample_range`	Clip range	`1.0`
`policy.num_inference_steps`	Inference steps	`null` (same as train steps)

Diffusion Policy arguments - optimizer & scheduler

Argument	Description	Default / Notes
`policy.optimizer_lr`	Learning rate	`1e-4`
`policy.optimizer_betas`	Adam betas	`[0.95,0.999]`
`policy.optimizer_eps`	Adam eps	`1e-8`
`policy.optimizer_weight_decay`	Weight decay	`1e-6`
`policy.scheduler_name`	LR scheduler	`cosine`
`policy.scheduler_warmup_steps`	Warmup steps	`500`

Diffusion Policy arguments - other

Argument	Description	Default / Notes
`policy.compile_model`	Compile model	`false`
`policy.compile_mode`	Compile mode	`reduce-overhead`
`policy.do_mask_loss_for_padding`	Mask padding loss	`false`

3. π₀ (PI0)

Download pretrained weights:

# Download pretrained weights
hf download \
  lerobot/pi0_base \
  --local-dir pretrained/pi0_base

hf download \
  lerobot/pi05_base \
  --local-dir pretrained/pi05_base
  
hf download google/paligemma-3b-pt-224 \
  --local-dir pretrained/paligemma-3b-pt-224

Search in /workspace/GlobalHumanoidRobotChallenge_2026_Baseline/src/lerobot/processor/tokenizer_processor.py and replace the code block with:

if self.tokenizer is not None:
            # Use provided tokenizer object directly
            self.input_tokenizer = self.tokenizer
        elif self.tokenizer_name is not None:
            if AutoTokenizer is None:
                raise ImportError("AutoTokenizer is not available")

            # If tokenizer_name contains paligemma, it is a pi0 model; force local offline loading
            if "paligemma" in self.tokenizer_name.lower():
                self.input_tokenizer = AutoTokenizer.from_pretrained(
                    "/root/.cache/huggingface/hub/models--google--paligemma-3b-pt-224/snapshots/35e4f46485b4d07967e7e9935bc3786aad50687c",
                    local_files_only=True
                )
            else:
                # Otherwise (e.g., act/smolvla), load normally using the provided tokenizer_name
                self.input_tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name)
            
        else:
            raise ValueError(
                "Either 'tokenizer' or 'tokenizer_name' must be provided. "
                "Pass a tokenizer object directly or a tokenizer name to auto-load."
            )

Below is a training example using π₀ (PI0), with full hyperparameters:

# 简洁训练命令
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
  --policy.path=lerobot/pi0_base \
  --dataset.repo_id=your_org/your_dataset \
  --batch_size=64 \
  --steps=20000 \
  --output_dir=challenge2026_baseline/Part_Sorting/pi0 \
  --job_name=part_sorting_pi0 \
  --policy.device=cuda \
  --wandb.enable=true

# 详细训练命令
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
  --dataset.repo_id=your_org/your_dataset \
  --dataset.root=datasets/Part_Sorting \
  --policy.type=pi0 \
  --policy.paligemma_variant=gemma_2b \
  --policy.action_expert_variant=gemma_300m \
  --policy.dtype=float32 \
  --policy.n_obs_steps=1 \
  --policy.chunk_size=50 \
  --policy.n_action_steps=50 \
  --policy.max_state_dim=32 \
  --policy.max_action_dim=32 \
  --policy.num_inference_steps=10 \
  --policy.time_sampling_beta_alpha=1.5 \
  --policy.time_sampling_beta_beta=1.0 \
  --policy.time_sampling_scale=0.999 \
  --policy.time_sampling_offset=0.001 \
  --policy.min_period=0.004 \
  --policy.max_period=4.0 \
  --policy.image_resolution='[224,224]' \
  --policy.empty_cameras=0 \
  --policy.gradient_checkpointing=false \
  --policy.compile_model=false \
  --policy.compile_mode=max-autotune \
  --policy.freeze_vision_encoder=false \
  --policy.train_expert_only=false \
  --policy.optimizer_lr=2.5e-5 \
  --policy.optimizer_betas='[0.9,0.95]' \
  --policy.optimizer_eps=1e-8 \
  --policy.optimizer_weight_decay=0.01 \
  --policy.optimizer_grad_clip_norm=1.0 \
  --policy.scheduler_warmup_steps=1000 \
  --policy.scheduler_decay_steps=30000 \
  --policy.scheduler_decay_lr=2.5e-6 \
  --policy.tokenizer_max_length=48 \
  --output_dir=challenge2026_baseline/Part_Sorting/pi0 \
  --job_name=part_sorting_pi0 \
  --resume=false \
  --seed=1000 \
  --num_workers=8 \
  --batch_size=8 \
  --steps=100000 \
  --eval_freq=0 \
  --log_freq=200 \
  --save_checkpoint=true \
  --save_freq=5000 \
  --wandb.entity=your_wandb_entity

Replace your_org/your_dataset with your own dataset repo ID, replace challenge2026_baseline/Part_Sorting/pi0 with your own output path, and replace your_wandb_entity with your WandB username or team name. If you don't use WandB, you can remove the --wandb.entity argument.

π₀ policy arguments - model architecture

Argument	Description	Default / Notes
`policy.type`	Policy algorithm type	`pi0`
`policy.paligemma_variant`	PaliGemma variant	`gemma_2b`
`policy.action_expert_variant`	Action Expert variant	`gemma_300m`
`policy.dtype`	Data type	`float32`

π₀ policy arguments - input/output structure

Argument	Description	Default / Notes
`policy.n_obs_steps`	Number of observation steps	`1`
`policy.chunk_size`	Action chunk size	`50`
`policy.n_action_steps`	Action steps executed	`50`
`policy.max_state_dim`	Max state dim (padded to)	`32`
`policy.max_action_dim`	Max action dim (padded to)	`32`

π₀ policy arguments - flow matching

Argument	Description	Default / Notes
`policy.num_inference_steps`	Denoising steps (inference)	`10`
`policy.time_sampling_beta_alpha`	Time-sampling beta α	`1.5`
`policy.time_sampling_beta_beta`	Time-sampling beta β	`1.0`
`policy.time_sampling_scale`	Time-sampling scale	`0.999`
`policy.time_sampling_offset`	Time-sampling offset	`0.001`
`policy.min_period`	Minimum period	`0.004`
`policy.max_period`	Maximum period	`4.0`

π₀ policy arguments - images & cameras

Argument	Description	Default / Notes
`policy.image_resolution`	Image resolution (H, W)	`[224,224]`
`policy.empty_cameras`	Number of empty cameras	`0`

π₀ policy arguments - training settings

Argument	Description	Default / Notes
`policy.gradient_checkpointing`	Enable gradient checkpointing	`false`
`policy.compile_model`	Compile model	`false`
`policy.compile_mode`	Compile mode	`max-autotune`

π₀ policy arguments - fine-tuning

Argument	Description	Default / Notes
`policy.freeze_vision_encoder`	Freeze vision encoder	`false`
`policy.train_expert_only`	Train Action Expert only	`false`

π₀ policy arguments - optimizer

Argument	Description	Default / Notes
`policy.optimizer_lr`	Learning rate	`2.5e-5`
`policy.optimizer_betas`	AdamW betas	`[0.9,0.95]`
`policy.optimizer_eps`	AdamW eps	`1e-8`
`policy.optimizer_weight_decay`	Weight decay	`0.01`
`policy.optimizer_grad_clip_norm`	Gradient clip norm	`1.0`

π₀ policy arguments - LR scheduler

Argument	Description	Default / Notes
`policy.scheduler_warmup_steps`	Warmup steps	`1000`
`policy.scheduler_decay_steps`	Decay steps	`30000`
`policy.scheduler_decay_lr`	Decay learning rate	`2.5e-6`

π₀ policy arguments - tokenizer

Argument	Description	Default / Notes
`policy.tokenizer_max_length`	Max tokenizer length	`48`

4. π₀.₅ (PI05)

Below is a training example using π₀.₅ (PI05) with full hyperparameters. π₀.₅ is an enhanced version of π₀ that supports open-world generalization. Key differences include QUANTILES normalization, a longer tokenizer length, and AdaRMS conditioning.

# 简洁训练命令
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
    --dataset.repo_id=your_org/your_dataset \
    --policy.type=pi05 \
    --output_dir=challenge2026_baseline/Part_Sorting/pi05 \
    --job_name=part_sorting_pi05 \
    --policy.repo_id=your_repo_id \
    --policy.pretrained_path=lerobot/pi05_base \
    --policy.compile_model=true \
    --policy.gradient_checkpointing=true \
    --wandb.enable=true \
    --policy.dtype=bfloat16 \
    --policy.freeze_vision_encoder=false \
    --policy.train_expert_only=false \
    --steps=3000 \
    --policy.device=cuda \
    --batch_size=32

# 详细训练命令
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
  --dataset.repo_id=your_org/your_dataset \
  --dataset.root=datasets/Part_Sorting/ \
  --policy.type=pi05 \
  --policy.paligemma_variant=gemma_2b \
  --policy.action_expert_variant=gemma_300m \
  --policy.dtype=float32 \
  --policy.n_obs_steps=1 \
  --policy.chunk_size=50 \
  --policy.n_action_steps=50 \
  --policy.max_state_dim=32 \
  --policy.max_action_dim=32 \
  --policy.num_inference_steps=10 \
  --policy.time_sampling_beta_alpha=1.5 \
  --policy.time_sampling_beta_beta=1.0 \
  --policy.time_sampling_scale=0.999 \
  --policy.time_sampling_offset=0.001 \
  --policy.min_period=0.004 \
  --policy.max_period=4.0 \
  --policy.image_resolution='[224,224]' \
  --policy.empty_cameras=0 \
  --policy.gradient_checkpointing=false \
  --policy.compile_model=false \
  --policy.compile_mode=max-autotune \
  --policy.freeze_vision_encoder=false \
  --policy.train_expert_only=false \
  --policy.optimizer_lr=2.5e-5 \
  --policy.optimizer_betas='[0.9,0.95]' \
  --policy.optimizer_eps=1e-8 \
  --policy.optimizer_weight_decay=0.01 \
  --policy.optimizer_grad_clip_norm=1.0 \
  --policy.scheduler_warmup_steps=1000 \
  --policy.scheduler_decay_steps=30000 \
  --policy.scheduler_decay_lr=2.5e-6 \
  --policy.tokenizer_max_length=200 \
  --output_dir=challenge2026_baseline/Part_Sorting/pi05 \
  --job_name=part_sorting_pi05 \
  --resume=false \
  --seed=1000 \
  --num_workers=8 \
  --batch_size=8 \
  --steps=100000 \
  --eval_freq=0 \
  --log_freq=200 \
  --save_checkpoint=true \
  --save_freq=5000 \
  --wandb.entity=your_wandb_entity

Replace your_org/your_dataset with your own dataset repo ID, replace challenge2026_baseline/Part_Sorting/pi05 with your own output path, and replace your_wandb_entity with your WandB username or team name. If you don't use WandB, you can remove the --wandb.entity argument.

π₀.₅ policy arguments - model architecture

Argument	Description	Default / Notes
`policy.type`	Policy algorithm type	`pi05`
`policy.paligemma_variant`	PaliGemma variant	`gemma_2b`
`policy.action_expert_variant`	Action Expert variant	`gemma_300m`
`policy.dtype`	Data type	`float32`

π₀.₅ policy arguments - input/output structure

Argument	Description	Default / Notes
`policy.n_obs_steps`	Number of observation steps	`1`
`policy.chunk_size`	Action chunk size	`50`
`policy.n_action_steps`	Action steps executed	`50`
`policy.max_state_dim`	Max state dim (padded to)	`32`
`policy.max_action_dim`	Max action dim (padded to)	`32`

π₀.₅ policy arguments - flow matching

Argument	Description	Default / Notes
`policy.num_inference_steps`	Denoising steps (inference)	`10`
`policy.time_sampling_beta_alpha`	Time-sampling beta α	`1.5`
`policy.time_sampling_beta_beta`	Time-sampling beta β	`1.0`
`policy.time_sampling_scale`	Time-sampling scale	`0.999`
`policy.time_sampling_offset`	Time-sampling offset	`0.001`
`policy.min_period`	Minimum period	`0.004`
`policy.max_period`	Maximum period	`4.0`

π₀.₅ policy arguments - images & cameras

Argument	Description	Default / Notes
`policy.image_resolution`	Image resolution (H, W)	`[224,224]`
`policy.empty_cameras`	Number of empty cameras	`0`

π₀.₅ policy arguments - training settings

Argument	Description	Default / Notes
`policy.gradient_checkpointing`	Enable gradient checkpointing	`false`
`policy.compile_model`	Compile model	`false`
`policy.compile_mode`	Compile mode	`max-autotune`

π₀.₅ policy arguments - fine-tuning

Argument	Description	Default / Notes
`policy.freeze_vision_encoder`	Freeze vision encoder	`false`
`policy.train_expert_only`	Train Action Expert only	`false`

π₀.₅ policy arguments - optimizer

Argument	Description	Default / Notes
`policy.optimizer_lr`	Learning rate	`2.5e-5`
`policy.optimizer_betas`	AdamW betas	`[0.9,0.95]`
`policy.optimizer_eps`	AdamW eps	`1e-8`
`policy.optimizer_weight_decay`	Weight decay	`0.01`
`policy.optimizer_grad_clip_norm`	Gradient clip norm	`1.0`

π₀.₅ policy arguments - LR scheduler

Argument	Description	Default / Notes
`policy.scheduler_warmup_steps`	Warmup steps	`1000`
`policy.scheduler_decay_steps`	Decay steps	`30000`
`policy.scheduler_decay_lr`	Decay learning rate	`2.5e-6`

π₀.₅ policy arguments - tokenizer

Argument	Description	Default / Notes
`policy.tokenizer_max_length`	Max tokenizer length	`200` (π₀ uses `48`)

Key differences between π₀ and π₀.₅

Feature	π₀	π₀.₅
Time conditioning injection	Concatenate time and action via `action_time_mlp_*`	AdaRMS conditioning via `time_mlp_*`
AdaRMS	Not used	Used in Action Expert
Tokenizer length	48 tokens	200 tokens
Discrete state input	False (uses `state_proj` layer)	True
Parameter count	Higher (includes state embedding)	Lower (no state embedding)
State normalization	MEAN_STD	QUANTILES
Action normalization	MEAN_STD	QUANTILES

5. SmolVLA

Below is a fine-tuning example using the SmolVLA policy. SmolVLA is built on the SmolVLM2-500M-Video-Instruct vision-language model and supports open-world generalization.

# 简洁训练命令
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=your_org/your_dataset \
  --batch_size=64 \
  --steps=20000 \
  --output_dir=challenge2026_baseline/Part_Sorting/smolvla \
  --job_name=part_sorting_smolvla \
  --policy.device=cuda \
  --wandb.enable=true

# 详细训练命令
/isaac-sim/python.sh src/lerobot/scripts/lerobot_train.py \
  --dataset.repo_id=your_org/your_dataset \
  --dataset.root=datasets/Part_Sorting/ \
  --policy.type=smolvla \
  --policy.vlm_model_name=HuggingFaceTB/SmolVLM2-500M-Video-Instruct \
  --policy.load_vlm_weights=true \
  --policy.dtype=float32 \
  --policy.n_obs_steps=1 \
  --policy.chunk_size=50 \
  --policy.n_action_steps=50 \
  --policy.max_state_dim=32 \
  --policy.max_action_dim=32 \
  --policy.num_steps=10 \
  --policy.tokenizer_max_length=48 \
  --policy.image_resolution='[224,224]' \
  --policy.empty_cameras=0 \
  --policy.freeze_vision_encoder=true \
  --policy.train_expert_only=true \
  --policy.train_state_proj=true \
  --policy.gradient_checkpointing=false \
  --policy.compile_model=false \
  --policy.compile_mode=max-autotune \
  --policy.attention_mode=cross_attn \
  --policy.num_vlm_layers=16 \
  --policy.self_attn_every_n_layers=2 \
  --policy.expert_width_multiplier=0.75 \
  --policy.optimizer_lr=1e-4 \
  --policy.optimizer_betas='[0.9,0.95]' \
  --policy.optimizer_eps=1e-8 \
  --policy.optimizer_weight_decay=1e-10 \
  --policy.optimizer_grad_clip_norm=10.0 \
  --policy.scheduler_warmup_steps=1000 \
  --policy.scheduler_decay_steps=30000 \
  --policy.scheduler_decay_lr=2.5e-6 \
  --policy.min_period=0.004 \
  --policy.max_period=4.0 \
  --output_dir=challenge2026_baseline/Part_Sorting/smolvla \
  --job_name=part_sorting_smolvla \
  --resume=false \
  --seed=1000 \
  --num_workers=8 \
  --batch_size=8 \
  --steps=100000 \
  --eval_freq=0 \
  --log_freq=200 \
  --save_checkpoint=true \
  --save_freq=5000 \
  --wandb.entity=your_wandb_entity

Replace your_org/your_dataset with your own dataset repo ID, replace challenge2026_baseline/Part_Sorting/smolvla with your own output path, and replace your_wandb_entity with your WandB username or team name. If you don't use WandB, you can remove the --wandb.entity argument.

SmolVLA policy arguments - model architecture

Argument	Description	Default / Notes
`policy.type`	Policy algorithm type	`smolvla`
`policy.vlm_model_name`	VLM backbone	`HuggingFaceTB/SmolVLM2-500M-Video-Instruct`
`policy.load_vlm_weights`	Load pretrained VLM weights	`true`
`policy.dtype`	Data type	`float32`

SmolVLA policy arguments - input/output structure

Argument	Description	Default / Notes
`policy.n_obs_steps`	Number of observation steps	`1`
`policy.chunk_size`	Action chunk size	`50`
`policy.n_action_steps`	Action steps executed	`50`
`policy.max_state_dim`	Max state dim (padded to)	`32`
`policy.max_action_dim`	Max action dim (padded to)	`32`

SmolVLA policy arguments - decoding & tokenizer

Argument	Description	Default / Notes
`policy.num_steps`	Denoising steps (inference)	`10`
`policy.tokenizer_max_length`	Max tokenizer length	`48`
`policy.use_cache`	Use attention cache	`true`

SmolVLA policy arguments - images & cameras

Argument	Description	Default / Notes
`policy.image_resolution`	Image preprocessing resolution (H, W)	`[224,224]`
`policy.empty_cameras`	Number of empty cameras	`0`
`policy.add_image_special_tokens`	Use image special tokens	`false`

SmolVLA policy arguments - fine-tuning

Argument	Description	Default / Notes
`policy.freeze_vision_encoder`	Freeze vision encoder	`true`
`policy.train_expert_only`	Train Action Expert only	`true`
`policy.train_state_proj`	Train state projection	`true`

SmolVLA policy arguments - transformer architecture

Argument	Description	Default / Notes
`policy.attention_mode`	Attention mode	`cross_attn`
`policy.num_vlm_layers`	Number of VLM layers used	`16`
`policy.self_attn_every_n_layers`	Insert self-attention every N layers	`2`
`policy.expert_width_multiplier`	Action Expert hidden width multiplier	`0.75`

SmolVLA policy arguments - optimizer

Argument	Description	Default / Notes
`policy.optimizer_lr`	Learning rate	`1e-4`
`policy.optimizer_betas`	AdamW betas	`[0.9,0.95]`
`policy.optimizer_eps`	AdamW eps	`1e-8`
`policy.optimizer_weight_decay`	Weight decay	`1e-10`
`policy.optimizer_grad_clip_norm`	Gradient clip norm	`10.0`

SmolVLA policy arguments - LR scheduler

Argument	Description	Default / Notes
`policy.scheduler_warmup_steps`	Warmup steps	`1000`
`policy.scheduler_decay_steps`	Decay steps	`30000`
`policy.scheduler_decay_lr`	Decay learning rate	`2.5e-6`

SmolVLA policy arguments - training settings

Argument	Description	Default / Notes
`policy.gradient_checkpointing`	Enable gradient checkpointing	`false`
`policy.compile_model`	Compile model	`false`
`policy.compile_mode`	Compile mode	`max-autotune`

1. ACT​

2. Diffusion Policy (DP)​

3. π₀ (PI0)​

4. π₀.₅ (PI05)​

5. SmolVLA​

1. ACT

2. Diffusion Policy (DP)

3. π₀ (PI0)

4. π₀.₅ (PI05)

5. SmolVLA