RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

Ehsan Ahmadi1,2, Hunter Schofield2,3, Behzad Khamidehi2, Fazel Arasteh2,
Jinjun Shan3, Lili Mou1,4, Dongfeng Bai2, Kasra Rezaee2
1University of Alberta, 2Huawei Technologies Canada
3York University, 4Canada CIFAR AI Chair, Amii
CVPR 2026 Highlight
RLFTSim post-training pipeline overview

RLFTSim post-training pipeline. We fine-tune a pre-trained simulator with closed-loop, on-policy RL to match the real-world distribution across kinematic, interactive, and map-based features. Our key contribution is MLOO, a dense, low-variance, per-rollout reward built from a leave-one-out construction over the realism meta-metric — making fine-tuning sample-efficient and stable. An optional goal input with a goal-attainment reward further distills controllability without sacrificing realism.

Abstract

Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning.

Presentation Video

Qualitative Visualization: Pre-Training vs. Post-Training

Side-by-side comparison of the base model (pre-train) and RLFTSim (post-train) on challenging traffic scenarios.

Pre-train
Base model: collision and off-road behavior
Post-train (RLFTSim)
RLFTSim: no incident

Figure 1 — Collision & Off-road: The pre-trained model generates unrealistic off-road behavior and a collision with cross-traffic, while the post-trained model (RLFTSim) produces realistic lane-following behavior that respects traffic rules.

Pre-train
Base model: collision with pedestrian
Post-train (RLFTSim)
RLFTSim: no collision

Figure S1 — Collision 1: In the pre-trained model, the vehicle entering the circle fails to yield to the pedestrian and collides with it. In the post-trained model, the vehicle yields to the pedestrian.

Pre-train
Base model: rear-end collision
Post-train (RLFTSim)
RLFTSim: no collision

Figure S2 — Collision 2: The pre-trained model produces a rear-end collision between two vehicles at the bottom of the scene. The post-trained model avoids this accident.

Pre-train
Base model: parking exit collision
Post-train (RLFTSim)
RLFTSim: no collision

Figure S3 — Collision 3: The pre-trained model's parked vehicle attempts to enter the road, colliding with a passing vehicle. The post-trained model waits for the road to clear before entering.

Pre-train
Base model: cyclist off-road
Post-train (RLFTSim)
RLFTSim: cyclist stays on road

Figure S4 — Off-road: The pre-trained model's cyclist goes off-road. The post-trained model (RLFTSim) adheres to the drivable area.

Goal-Conditioned Fine-Tuning (GCFT)

Controllability via goal-conditioning. The goal point is shown as a magenta circle. We compare various goal representations.

U-Turn Goal
Left-Turn Goal
Pre-train (base model)
Pre-train: U-turn

Successful U-Turn

Pre-train: failed left turn

Failed Left Turn

Post-train — concatenation, hard goal
GCFT cat-hard: U-turn

Successful U-Turn

GCFT cat-hard: left turn

Successful Left Turn

Post-train — indication, hard goal
GCFT ind-hard: U-turn

Successful U-Turn

GCFT ind-hard: left turn

Successful Left Turn

Figure 1 (right) — GCFT Visualization. The goal point is shown with a magenta circle. We show results for concatenation (cat) and edge-indication (ind) goal representations with hard goal targets. The pre-trained model fails at the left turn, while both GCFT variants succeed.

GCFT: Traffic Rules & Parking Scenarios

Goal-conditioning also distills compliance with traffic rules and fine-grained parking maneuvers from a single pre-trained simulator.

Traffic Light Compliance
GCFT: stop at red light

Stop at Red Light

GCFT: right turn at red

Right Turn at Red

Stop Sign Behavior
GCFT: stop sign right turn

Right Turn at Stop Sign

GCFT: stop sign left turn

Left Turn at Stop Sign

Parking Maneuvers
GCFT: parking move forward

Move Forward

GCFT: parking stationary

Stationary

GCFT: parking right turn 1

Right Turn 1

GCFT: parking right turn 2

Right Turn 2

Poster

Download the poster (PDF)

RLFTSim CVPR 2026 poster

Model Checkpoints

We are happy to share the RLFTSim checkpoints used in our WOSAC submission. To request access, please reach out to [eahmadi at ualberta dot ca]. Due to Waymo's data usage policy, we ask that you provide a screenshot confirming your registration on the My Submissions page of the Waymo Open Dataset. Our checkpoints are built on top of the SMART-tiny architecture. For inference, you can use the codebases from CAT-K or SMART.

BibTeX

@InProceedings{Ahmadi_2026_CVPR,
    author    = {Ahmadi, Ehsan and Schofield, Hunter and Khamidehi, Behzad and Arasteh, Fazel and Shan, Jinjun and Mou, Lili and Bai, Dongfeng and Rezaee, Kasra},
    title     = {RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {39734-39743},
}