RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

Ehsan Ahmadi1,2, Hunter Schofield2,3, Behzad Khamidehi2, Fazel Arasteh2,
Jinjun Shan3, Lili Mou1,4, Kasra Rezaee2, Dongfeng Bai2
1University of Alberta, 2Huawei Technologies Canada, 3York University
4Canada CIFAR AI Chair, Amii
Accepted at CVPR 2026

Abstract

Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning.

Qualitative Visualization: Pre-Training vs. Post-Training

Side-by-side comparison of the base model (pre-train) and RLFTSim (post-train) on challenging traffic scenarios.

Pre-train
Base model: collision and off-road behavior
Post-train (RLFTSim)
RLFTSim: no incident

Figure 1 — Collision & Off-road: The pre-trained model generates unrealistic off-road behavior and a collision with cross-traffic, while the post-trained model (RLFTSim) produces realistic lane-following behavior that respects traffic rules.

Pre-train
Base model: collision with pedestrian
Post-train (RLFTSim)
RLFTSim: no collision

Figure S1 — Collision 1: In the pre-trained model, the vehicle entering the circle fails to yield to the pedestrian and collides with it. In the post-trained model, the vehicle yields to the pedestrian.

Pre-train
Base model: rear-end collision
Post-train (RLFTSim)
RLFTSim: no collision

Figure S2 — Collision 2: The pre-trained model produces a rear-end collision between two vehicles at the bottom of the scene. The post-trained model avoids this accident.

Pre-train
Base model: parking exit collision
Post-train (RLFTSim)
RLFTSim: no collision

Figure S3 — Collision 3: The pre-trained model's parked vehicle attempts to enter the road, colliding with a passing vehicle. The post-trained model waits for the road to clear before entering.

Pre-train
Base model: cyclist off-road
Post-train (RLFTSim)
RLFTSim: cyclist stays on road

Figure S4 — Off-road: The pre-trained model's cyclist goes off-road. The post-trained model (RLFTSim) adheres to the drivable area.

Goal-Conditioned Fine-Tuning (GCFT)

Controllability via goal-conditioning. The goal point is shown as a magenta circle. We compare various goal representations.

U-Turn Goal
Left-Turn Goal
Pre-train (base model)
Pre-train: U-turn

Successful U-Turn

Pre-train: failed left turn

Failed Left Turn

Post-train — concatenation, hard goal
GCFT cat-hard: U-turn

Successful U-Turn

GCFT cat-hard: left turn

Successful Left Turn

Post-train — indication, hard goal
GCFT ind-hard: U-turn

Successful U-Turn

GCFT ind-hard: left turn

Successful Left Turn

Figure 1 (right) — GCFT Visualization. The goal point is shown with a magenta circle. We show results for concatenation (cat) and edge-indication (ind) goal representations with hard goal targets. The pre-trained model fails at the left turn, while both GCFT variants succeed.

GCFT: Traffic Rules & Parking Scenarios

Traffic Light Compliance

GCFT: stop at red light

Stop at Red Light

GCFT: right turn at red

Right Turn at Red

Stop Sign Behavior

GCFT: stop sign right turn

Right Turn at Stop Sign

GCFT: stop sign left turn

Left Turn at Stop Sign

Parking Maneuvers

GCFT: parking move forward

Move Forward

GCFT: parking stationary

Stationary

GCFT: parking right turn 1

Right Turn 1

GCFT: parking right turn 2

Right Turn 2

Model Checkpoints

We are happy to share the RLFTSim checkpoints used in our WOSAC submission. To request access, please reach out to [eahmadi at ualberta dot ca]. Due to Waymo's data usage policy, we ask that you provide a screenshot confirming your registration on the My Submissions page of the Waymo Open Dataset. Our checkpoints are built on top of the SMART-tiny architecture. For inference, you can use the codebases from CAT-K or SMART.