RLFTSim: Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

Abstract

Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning.

Qualitative Visualization: Pre-Training vs. Post-Training

Side-by-side comparison of the base model (pre-train) and RLFTSim (post-train) on challenging traffic scenarios.

Pre-train

Base model: collision and off-road behavior

Post-train (RLFTSim)

Figure 1 — Collision & Off-road: The pre-trained model generates unrealistic off-road behavior and a collision with cross-traffic, while the post-trained model (RLFTSim) produces realistic lane-following behavior that respects traffic rules.

Pre-train

Post-train (RLFTSim)

Figure S1 — Collision 1: In the pre-trained model, the vehicle entering the circle fails to yield to the pedestrian and collides with it. In the post-trained model, the vehicle yields to the pedestrian.

Pre-train

Post-train (RLFTSim)

Figure S2 — Collision 2: The pre-trained model produces a rear-end collision between two vehicles at the bottom of the scene. The post-trained model avoids this accident.

Pre-train

Post-train (RLFTSim)

Figure S3 — Collision 3: The pre-trained model's parked vehicle attempts to enter the road, colliding with a passing vehicle. The post-trained model waits for the road to clear before entering.

Pre-train

Post-train (RLFTSim)

Figure S4 — Off-road: The pre-trained model's cyclist goes off-road. The post-trained model (RLFTSim) adheres to the drivable area.

Goal-Conditioned Fine-Tuning (GCFT)

Controllability via goal-conditioning. The goal point is shown as a magenta circle. We compare various goal representations.

U-Turn Goal

Left-Turn Goal

Pre-train (base model)

Successful U-Turn

Failed Left Turn

Post-train — concatenation, hard goal

Successful U-Turn

Successful Left Turn

Post-train — indication, hard goal

Successful U-Turn

Successful Left Turn

Figure 1 (right) — GCFT Visualization. The goal point is shown with a magenta circle. We show results for concatenation (cat) and edge-indication (ind) goal representations with hard goal targets. The pre-trained model fails at the left turn, while both GCFT variants succeed.

GCFT: Traffic Rules & Parking Scenarios

Traffic Light Compliance

Stop at Red Light

Right Turn at Red

Stop Sign Behavior

Right Turn at Stop Sign

Left Turn at Stop Sign

Parking Maneuvers

Move Forward

Stationary

Right Turn 1

Right Turn 2

Model Checkpoints

We are happy to share the RLFTSim checkpoints used in our WOSAC submission. To request access, please reach out to [eahmadi at ualberta dot ca]. Due to Waymo's data usage policy, we ask that you provide a screenshot confirming your registration on the My Submissions page of the Waymo Open Dataset. Our checkpoints are built on top of the SMART-tiny architecture. For inference, you can use the codebases from CAT-K or SMART.

RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

Abstract

Qualitative Visualization: Pre-Training vs. Post-Training

Goal-Conditioned Fine-Tuning (GCFT)

GCFT: Traffic Rules & Parking Scenarios

Traffic Light Compliance

Stop Sign Behavior

Parking Maneuvers

Model Checkpoints