arxiv:2604.15308

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Published on Apr 16

· Submitted by

Hao Gao on Apr 17

#3 Paper of the day

Huazhong University of Science and Technology

Upvote

Authors:

Hao Gao ,

Abstract

A unified generator-discriminator framework for autonomous driving motion planning that improves stability and performance through diffusion-based trajectory generation and reinforcement learning optimization.

AI-generated summary

High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.

View arXiv page View PDF Project page GitHub 204 Add to collection

Community

Hao605

Paper author Paper submitter about 16 hours ago

•

edited about 16 hours ago

RAD-2 synergizes a Diffusion-based generator 𝒢 and a Transformer-based discriminator 𝒟 within a multi-stage optimization loop: (a) Pre-training Stage: 𝒢 is initialized via imitation learning to capture multi-modal trajectory priors from expert demonstrations. (b) Closed-loop Rollout: The joint policy, integrating 𝒢 for generation and 𝒟 for selection, interacts with the high-throughput BEV-Warp environment to generate diverse rollout data. (c) Discriminator Optimization: 𝒟 is optimized via Temporally Consistent Group Relative Policy Optimization, leveraging closed-loop feedback to enhance its scoring precision. (d) Generator Optimization: Through On-policy Generator Optimization, 𝒢 is optimized via structured longitudinal optimizations derived from low-reward rollouts, effectively shifting its distribution toward safer and more efficient driving behaviors.