HumanoidBench-DR.Q · 自训通关 checkpoints

Self-trained DR.Q checkpoints that beat the public dmux/DR.Q baseline on HumanoidBench locomotion tasks.

🛠 训练源码 / Training source: https://github.com/vitorcen/humanoid-training 完整训练脚本、patches、eval harness、分析文档全在 GitHub 配套仓库。 Full training scripts, patches, eval harness, and analysis docs in the companion GitHub repo.

DR.Q 是 TD3 + model-based 表征学习的离策略 RL 算法（encoder + policy ~13 MB 推理）。本仓库收录在 HumanoidBench 上从零自训通关的 checkpoints。

DR.Q is a TD3-family off-policy RL algorithm with model-based representation learning (~13 MB encoder + policy at inference). This repo hosts checkpoints trained from scratch that pass the HumanoidBench locomotion bar.

📊 性能 / Performance

Task	success_rate	mean_return	N	对比公开 baseline
`h1-walk-v0`	90%	801.05	10 ep × seed 0	dmux/DR.Q seed 0: ~30% / mean ~530
`g1-walk-v0`	70%	710.52	10 ep × seed 0	torque baseline: 0% / mean ~100 (7.1× 提升)

success_bar = 700 (HumanoidBench locomotion threshold). All numbers from deterministic eval with action_repeat=2. Raw JSONL in eval/.

🎬 演示 / Demos

H1-walk-v0 (Unitree H1, 19 DoF)

G1-walk-v0 (Unitree G1, 23 DoF with BlockedHands wrapper)

📦 仓库内容 / Repo layout

HumanoidBench-DR.Q/
├── DRQ+HBench-h1-walk-v0+0/        # H1-walk self-trained ckpt (76 MB)
│   ├── encoder.pt   policy.pt   agent_var.npy        ← inference (~13 MB)
│   ├── *_target.pt × 3                                ← Q-learning targets
│   ├── *_optimizer.pt × 3                             ← Adam states (resume)
│   ├── value.pt                                       ← critic
│   └── exp_var.npy                                    ← exploration variance
├── DRQ+HBench-g1-walk-v0+0/        # G1-walk self-trained ckpt (76 MB)
│   └── ... (same 11-file layout)
├── eval/                            # Final eval JSONL (per-episode + summary row)
└── assets/                          # MP4 demos

推理只需 3 个文件：encoder.pt + policy.pt + agent_var.npy（共 ~13 MB）。其余 8 个文件用于续训与 Q-learning target。

Only 3 files are needed for inference; the rest are for resume-training and Q-learning targets.

🚀 加载与推理 / Load & inference

# Minimal inference loader — see scripts/drq_viewer.py in companion repo
import torch, numpy as np
from huggingface_hub import snapshot_download

ckpt_dir = snapshot_download(
    repo_id="wsagi/HumanoidBench-DR.Q",
    allow_patterns="DRQ+HBench-h1-walk-v0+0/*",
)

var = np.load(f"{ckpt_dir}/DRQ+HBench-h1-walk-v0+0/agent_var.npy", allow_pickle=True).item()
# encoder = Encoder(obs_dim, ...);  encoder.load_state_dict(torch.load(.../encoder.pt))
# policy  = Policy(...);            policy.load_state_dict(torch.load(.../policy.pt))

完整加载链路参考配套仓库 vitorcen/humanoid-training 的 scripts/drq_viewer.py。

git clone --recursive https://github.com/vitorcen/humanoid-training
cd humanoid-training
bash patches/apply.sh                     # apply DR.Q + HumanoidBench local patches
DISPLAY=:0 python scripts/drq_viewer.py --task h1-walk-v0 --seed 0
DISPLAY=:0 python scripts/drq_viewer.py --task g1-walk-v0 --seed 0

⚠️ G1-walk 必备 patches / Required patches for G1

G1 通关不是开箱即用，需两层 patch（详见 g1_training_strategies.html）：

Patch	作用
`patches/g1-pos-control.patch`	G1 默认 torque control → PD position control（与 H1 一致），sample efficiency 4×↑
`patches/humanoid-bench-g1-blocked-hands.patch`	扩展 `BlockedHandsLocoWrapper` 支持 G1，屏蔽 14 维手指（37D → 23D action），避免噪声污染 encoder dynamics loss

G1 raw torque baseline trained 1M steps and stayed at 0% / mean ~100. The combined patches lift it to 70% / mean 711.

根因（OpenCode deepseek-v4-pro 诊断）：DR.Q 同方差 σ=0.2 exploration noise 在 37D action 上几乎每一步都扰动手指，encoder 的 dynamics loss 被迫学习无关的手指动力学 → 250k step 时 catastrophic forgetting。

Root cause: isotropic σ=0.2 noise contaminates the encoder's dynamics loss with irrelevant 14-DoF finger motion, leading to catastrophic forgetting around 250k steps.

🔧 训练配置 / Training config

	H1-walk	G1-walk
Algorithm	DR.Q (TD3 + zs encoder)	DR.Q + PD control + BlockedHands
Env steps	500,000	500,000
Wall time	6.6 h	3.0 h
GPU	RTX 4090	RTX 4090
`action_repeat`	2	2
`save_freq`	50,000	50,000
Watcher	slice-based auto-eval + early-stop (LeIsaac-inspired)	same

训练流水线（三个并行进程，详见配套仓库 README）：

A) DR.Q main.py 主训
B) scripts/train_watcher.py — 分 10 slice 实时聚合 + PROGRESS/UNDERFIT/OVERFIT/DEAD 早停
C) scripts/ckpt_eval_loop.py — 每出 ckpt 自动 mirror 到 HF cache + N=3 deterministic eval

📚 引用 / Citations

@article{sferrazza2024humanoidbench,
  title={HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation},
  author={Sferrazza, Carmelo and Huang, Dun-Ming and Lin, Xingyu and Lee, Youngwoon and Abbeel, Pieter},
  journal={Robotics: Science and Systems},
  year={2024}
}

@article{yarats2022mastering,
  title={Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning},
  author={Yarats, Denis and Fergus, Rob and Lazaric, Alessandro and Pinto, Lerrel},
  journal={ICLR},
  year={2022}
}

📄 License

MIT — same as base DR.Q and HumanoidBench.

Companion repository: github.com/vitorcen/humanoid-training — full training scripts, patches, eval harness, and analysis docs.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for wsagi/HumanoidBench-DR.Q

Base model

dmux/DR.Q

Finetuned

(1)

this model