Jiajun Fan

↓ Jump to: 📰 News 🔥 Featured 📄 Publications 🔬 Research 🏅 Awards ✍️ Blog 📋 CV 📬 Contact

CS Ph.D. student at UIUC. I work on RL post-training for generative models — making diffusion/flow models and multimodal reasoning LLMs continuously self-improve with minimal human supervision. Previously: 24 Atari world records, 500× more data-efficient than Agent57, ICLR 2023 Oral (rank 5/4176).

📰 Latest News

Apr 2026 Upcoming🇧🇷 ICLR 2026 in Rio de Janeiro, Apr 23–27. Presenting CESAR & SP-VLA.
Jan 2026 Accept2 papers at ICLR 2026 — CESAR & SP-VLA. See you in Rio 🇧🇷
Sep 2025 Accept2 papers at NeurIPS 2025 — ADRPO & VarCon. See you in San Diego 🌊
Jun 2025 AcceptPaper accepted at IEEE TPAMI: PRANCE.
Feb 2025 AcceptPaper accepted at ICLR 2025: ORW-CFM-W2 (Flow Matching self-evolution).
Jan 2025 ServiceReviewer: ICLR 2025–26, NeurIPS 2024–25, ICML 2025–26, CVPR 2026, AAAI 2025, AISTATS 2025.
Aug 2024 🎓 Started Ph.D. at UIUC CS (GPA 4.0/4.0).
Jan 2023 Oral · Top 5%LBC at ICLR 2023, ranked 5/4176 — broke 24 Atari world records.

🔥 Featured Research

ICLR 2026

CESAR: Process Rewards for Audio LLM Reasoning

First to show test-time inverse scaling in Audio LLMs — and fix it with process rewards via online RL.

🏆 SOTA MMAU · Beats Gemini 2.5 Pro

NeurIPS 2025

ADRPO: Adaptive Regularization for Generative Model RLHF

Sample-level adaptive divergence regularization — 2B model outperforms 12B competitors.

🚀 2B SD3 > FLUX 12B & SANA 4.8B

ICLR 2025

ORW-CFM-W2: Online RLHF for Flow Matching

First collapse-free online RL framework for flow matching models with W2 regularization.

✨ First online RLHF for flow matching

LBC: Superhuman Atari — 24 World Records

Breaking 24 Atari human world records with 500× less data than prior SOTA (Agent57).

🏅 Rank 5/4176 · 10,077% mean human performance

📄 Selected Publications

* = first/co-first author · Full list on Google Scholar / Publications page

ICLR 20262026

Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards Project Page

CESAR resolves test-time inverse scaling in Audio LLMs by rewarding the reasoning process via GRPO, achieving SOTA on MMAU — outperforming Gemini 2.5 Pro and GPT-4o Audio.

J. Fan*, R. Ren, J. Li, R. Pandey, P.G. Shivakumar, I. Bulyko, A. Gandhe, G. Liu, Y. Gu

CESAR: process-reward RL (GRPO) resolving test-time inverse scaling in Audio LLMs — models produce hallucinatory reasoning without proper guidance; CESAR rewrites that.

🏆 SOTA on MMAU Test-mini · Outperforms Gemini 2.5 Pro & GPT-4o Audio

📄 Paper 🔗 Project 📎 arXiv

ICLR 20262026

SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration

SP-VLA introduces action-aware model scheduling and spatio-semantic token pruning for VLA model acceleration, achieving 1.5× lossless speedup on LIBERO and 2.4× speedup on SimplerEnv.

Y. Li, Y. Meng, Z. Sun, K. Ji, C. Tang, J. Fan, X. Ma, S.-T. Xia, Z. Wang, W. Zhu

Action-aware model scheduling + spatio-semantic token pruning for VLA acceleration.

⚡ 1.5× lossless speedup (LIBERO) · 2.4× speedup (SimplerEnv)

NeurIPS 20252025

Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models Project Page

ADRPO introduces sample-level adaptive divergence regularization for RLHF — high-value samples get more freedom, poor samples get stronger constraints. Plug-and-play on any RL method.

J. Fan*, T. Wei, C. Cheng, Y. Chen, G. Liu

ADRPO: sample-level adaptive divergence regularization — high-value samples get more freedom, poor samples get stronger constraint. Plug-and-play on top of any RLHF method.

🚀 2B SD3 surpasses 4.8B & 12B models · Generalizes to LLMs & audio reasoning

📄 Paper 🔗 Project 📎 arXiv

NeurIPS 20252025

Variational Supervised Contrastive Learning

📊 SOTA 79.36% Top-1 on ImageNet-1K with ResNet-50

VarCon reformulates supervised contrastive learning as variational inference, achieving SOTA 79.36% Top-1 accuracy on ImageNet-1K with ResNet-50.

Z. Wang, J. Fan, T. Nguyen, H. Ji, G. Liu

VarCon: supervised contrastive learning as variational inference — posterior-weighted ELBO replaces pairwise comparisons.

ICLR 20252025

Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization Project Page

🥇 First online RLHF for flow matching · Collapse-free W2 regularization

ORW-CFM-W2 is the first online RLHF method for flow matching — no human data, no likelihood estimation. Wasserstein regularization maintains generation diversity.

J. Fan*, S. Shen, C. Cheng, Y. Chen, C. Liang, G. Liu

ORW-CFM-W2: first online RLHF for flow matching — no human data, no likelihood, no collapse. W2 regularization keeps generation diverse.

Preprint2025

Fine-tuning Flow Matching Generative Models with Intermediate Feedback Project Page

⚙️ Actor-critic with step-level reward · Stable SD3 fine-tuning without collapse

AC-Flow introduces actor-critic with intermediate feedback for flow matching — reward shaping + dual-stability mechanism + Wasserstein regularization enables robust SD3 fine-tuning without collapse.

J. Fan*, C. Cheng, S. Shen, X. Zhou, G. Liu · Under Review

AC-Flow: actor-critic with intermediate feedback for flow matching — reward shaping + dual-stability + Wasserstein regularization. Robust fine-tuning on SD3 without collapse.

TPAMI 20262026

PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference

⚡ Joint token + channel pruning · Adaptive ViT inference · IEEE TPAMI 2026

PRANCE jointly optimizes token pruning and structural channel pruning for adaptive ViT inference, achieving significant speedup while maintaining accuracy.

Y. Li, C. Tang, Y. Meng, J. Fan, Z. Chai, X. Ma, Z. Wang, W. Zhu · IEEE TPAMI

ICLR 2023
Oral2023

Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection Project Page

LBC introduces a learnable hybrid behavior mapping and bandit meta-controller for exploration control in deep RL, breaking 24 Atari human world records with 500× less data than prior SOTA.

J. Fan*, Y. Zhuang, Y. Liu, J. Hao, B. Wang, J. Zhu, H. Wang, S.-T. Xia

LBC: learnable hybrid behavior mapping + bandit meta-controller. Unified framework for exploration control in deep RL.

🏅 Ranked 5/4176 · 10,077% mean human score · 24 world records · 500× data efficiency

ICML 20222022

Generalized Data Distribution Iteration Project Page

GDI shows that optimizing the training data distribution is the key lever for superhuman RL efficiency. Provides a unified framework that subsumes diverse RL algorithms as special cases.

J. Fan*, C. Xiao

GDI: optimizing the data distribution is the key to superhuman RL efficiency. Unified framework for diverse RL algorithms.

📈 Agent57 beaten with 500× less data & 2× avg performance

🕸️ Research Paper Network

Hover a node to highlight connections. Papers are grouped by research theme.

🔬 Research Interests

🌊

RL Post-Training for Generative Models

Collapse-free online RLHF for flow/diffusion models. No human-collected preference data needed — models improve from their own generations (ORW-CFM-W2, ADRPO, AC-Flow).

🧠

Reasoning in Multimodal LLMs

Process-reward RL for audio/visual LLMs — fixing test-time inverse scaling so reasoning actually helps, not hurts (CESAR).

🎮

Superhuman-Level Deep RL

Sample-efficient RL that exceeds human performance. Broke 24 Atari world records with 500× less data than prior SOTA (LBC, GDI).

⚡ Impact at a Glance

Top Venue Papers
ICLR · NeurIPS · ICML · TPAMI

Atari World Records
broken by LBC (ICLR'23 Oral)

More Data-Efficient
than Agent57

SOTA

MMAU Audio Reasoning
Beats Gemini 2.5 Pro

Google Scholar Citations

4.0

GPA — UIUC Ph.D.
Computer Science

💡 Research Vision

Making AI Systems That Improve Themselves
Today's AI is frozen after training. I work to change that: AI that never stops getting better, with progressively less human scaffolding.

Step 1 — ICLR 2025

Eliminate human-collected preference data

ORW-CFM-W2: online reward-weighted training lets models improve from their own generations — no paired human data needed.

Step 2 — NeurIPS 2025

Remove manual KL tuning

ADRPO: adaptive divergence control eliminates the need for hand-tuned regularization — each sample gets its own constraint.

Step 3 — ICLR 2026

Reward the reasoning process, not just outcomes

CESAR: process-level rewards resolve test-time inverse scaling in Audio LLMs — reasoning finally helps instead of hurts, achieving SOTA on MMAU.

Step 4 — Ongoing

Fully autonomous self-improvement

The endgame: generative models that continuously improve with progressively less human intervention — from data collection to reward design to training itself.

🏅 Awards & Academic Service

🎖 Selected Awards

National Scholarship ×2, Top 1% — Nankai Univ.
Ranked 1st / 83 in major — Nankai Univ.
Outstanding Graduates (Top 1%) — Nankai Univ.
Tang Lixin Scholarship (Top 1%)
GPA 4.0/4.0 — UIUC Ph.D.
ICLR 2023 Oral (Top 1.2%, 5/4176) — LBC paper
GPA 3.97/4.0, Top 1.3% — Tsinghua M.Eng.

🔍 Reviewer

ICLR 2024 · 2025 · 2026
NeurIPS 2022–2024 · 2025
ICML 2023–2024 · 2025 · 2026
CVPR 2026
AAAI 2025 · AISTATS 2025 · KDD 2024

📅 Conference Deadlines

Key AI/ML venue deadlines I track — for the full list see ccfddl.com.

📬 Contact

Happy to discuss research, internships, or collaborations. Best reached by email.
📧 jiajunf3@illinois.edu · 🏛 Siebel Center for CS, UIUC · CV (PDF) · 💼 LinkedIn · 🔬 ORCID

💬

🤖 Research Assistant