I am a Computer Science Ph.D. student at UIUC. My research focuses on autonomous RL post-training for large generative models — making diffusion/flow models and multi-modal reasoning LLMs continuously self-improve with less and less human intervention. Previously, I pushed RL to superhuman performance: breaking 24 Atari world records and outperforming Agent57 with 500× less data.
CESAR resolves test-time inverse scaling in Audio LLMs by rewarding the reasoning process via GRPO, achieving SOTA on MMAU — outperforming Gemini 2.5 Pro and GPT-4o Audio.
J. Fan*, R. Ren, J. Li, R. Pandey, P.G. Shivakumar, A. Gandhe, G. Liu, Y. Gu, I. Bulyko
CESAR: process-reward RL (GRPO) resolving test-time inverse scaling in Audio LLMs — models produce hallucinatory reasoning without proper guidance; CESAR rewrites that.
🏆 SOTA on MMAU Test-mini · Outperforms Gemini 2.5 Pro & GPT-4o Audio
SP-VLA introduces action-aware model scheduling and spatio-semantic token pruning for VLA model acceleration, achieving 1.5× lossless speedup on LIBERO and 2.4× speedup on SimplerEnv.
Y. Li, Y. Meng, Z. Sun, K. Ji, C. Tang, J. Fan, X. Ma, S.-T. Xia, Z. Wang, W. Zhu
Action-aware model scheduling + spatio-semantic token pruning for VLA acceleration.
ADRPO introduces sample-level adaptive divergence regularization for RLHF — high-value samples get more freedom, poor samples get stronger constraints. Plug-and-play on any RL method.
J. Fan*, T. Wei, C. Cheng, Y. Chen, G. Liu
ADRPO: sample-level adaptive divergence regularization — high-value samples get more freedom, poor samples get stronger constraint. Plug-and-play on top of any RLHF method.
ORW-CFM-W2 is the first online RLHF method for flow matching — no human data, no likelihood estimation. Wasserstein regularization maintains generation diversity.
J. Fan*, S. Shen, C. Cheng, Y. Chen, C. Liang, G. Liu
ORW-CFM-W2: first online RLHF for flow matching — no human data, no likelihood, no collapse. W2 regularization keeps generation diverse.
PRANCE jointly optimizes token pruning and structural channel pruning for adaptive ViT inference, achieving significant speedup while maintaining accuracy.
Y. Li, C. Tang, Y. Meng, J. Fan, Z. Chai, X. Ma, Z. Wang, W. Zhu · IEEE TPAMI
LBC introduces a learnable hybrid behavior mapping and bandit meta-controller for exploration control in deep RL, breaking 24 Atari human world records with 500× less data than prior SOTA.
J. Fan*, Y. Zhuang, Y. Liu, J. Hao, B. Wang, J. Zhu, H. Wang, S.-T. Xia
LBC: learnable hybrid behavior mapping + bandit meta-controller. Unified framework for exploration control in deep RL.
🏅 Ranked 5/4176 · 10,077% mean human score · 24 world records · 500× data efficiency
GDI shows that optimizing the training data distribution is the key lever for superhuman RL efficiency. Provides a unified framework that subsumes diverse RL algorithms as special cases.
J. Fan*, C. Xiao
GDI: optimizing the data distribution is the key to superhuman RL efficiency. Unified framework for diverse RL algorithms.
📈 Agent57 beaten with 500× less data & 2× avg performance
🕸️ Research Paper Network
Hover a node to highlight connections. Papers are grouped by research theme.
🔬 Research Interests
🌊
RL Post-Training for Generative Models
Collapse-free online RLHF for flow/diffusion models. No human-collected preference data needed — models improve from their own generations (ORW-CFM-W2, ADRPO, AC-Flow).
🧠
Reasoning in Multimodal LLMs
Process-reward RL for audio/visual LLMs — fixing test-time inverse scaling so reasoning actually helps, not hurts (CESAR).
🎮
Superhuman-Level Deep RL
Sample-efficient RL that exceeds human performance. Broke 24 Atari world records with 500× less data than prior SOTA (LBC, GDI).
⚡ Impact at a Glance
0
Top Venue Papers ICLR · NeurIPS · ICML · TPAMI
0
Atari World Records broken by LBC (ICLR'23 Oral)
0
More Data-Efficient than Agent57
SOTA
MMAU Audio Reasoning Beats Gemini 2.5 Pro
0
Google Scholar Citations
4.0
GPA — UIUC Ph.D. Computer Science
💡 Research Vision
Making AI Systems That Improve Themselves
Today's AI is frozen after training. I work to change that: AI that never stops getting better, with progressively less human scaffolding.
Step 1 — ICLR 2025
Eliminate human-collected preference data
ORW-CFM-W2: online reward-weighted training lets models improve from their own generations — no paired human data needed.
Step 2 — NeurIPS 2025
Remove manual KL tuning
ADRPO: adaptive divergence control eliminates the need for hand-tuned regularization — each sample gets its own constraint.
Step 3 — ICLR 2026
Reward the reasoning process, not just outcomes
CESAR: process-level rewards resolve test-time inverse scaling in Audio LLMs — reasoning finally helps instead of hurts, achieving SOTA on MMAU.
Step 4 — Ongoing
Fully autonomous self-improvement
The endgame: generative models that continuously improve with progressively less human intervention — from data collection to reward design to training itself.
🏅 Awards & Academic Service
🎖 Selected Awards
National Scholarship ×2, Top 1% — Nankai Univ.
Ranked 1st / 83 in major — Nankai Univ.
Outstanding Graduates (Top 1%) — Nankai Univ.
Tang Lixin Scholarship (Top 1%)
GPA 4.0/4.0 — UIUC Ph.D.
GPA 3.97/4.0, Top 1.3% — Tsinghua M.Eng.
🔍 Reviewer
ICLR 2024 · 2025 · 2026
NeurIPS 2022–2024 · 2025
ICML 2023–2024 · 2025 · 2026
CVPR 2026
AAAI 2025 · AISTATS 2025 · KDD 2024
📅 Conference Deadlines
Key AI/ML venue deadlines I track — for the full list see ccfddl.com.
📬 Contact
Happy to discuss research, internships, or collaborations. Best reached by email. 📧 jiajunf3@illinois.edu · 🏛 Siebel Center for CS, UIUC · CV (PDF)