About Me

I am a 4th year undergraduate student of Dept. of Computer Science and Technology of Tsinghua University, Beijing, PRC. I am also an incomming Ph.D Student of Prof. Minlie Huang @ Conversational AI Group starting from 2025 Fall. My research interests lie in LLM safety and trustworthy, and I’m recently working on the mechanism of jailbreaking attack & defense, hallucination and knowledge boundary of LLMs.

News

🎉 Meet Paper Pulse: The open-source framework that delivers the papers you actually care about, straight to your inbox with LLM-generated summaries! 🔍 Smart Filtering 📖 Deep Analysis 📧 Wake up to a beautiful Markdown briefing in your email. Read the day’s best in 1 minute. ☕️ link
🎉 Our Paper Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints is accepted by ACL 2025 Main! Please refer to our code and paper for more details.

Working Experiences

Research intern at A*STAR’s Centre for Frontier AI Research (CFAR), from Feb 2025 to May 2025, under the supervision of Prof. Yew-Soon Ong.

Publications

Yang, J.*, Zhang, Z.*, Cui, S., Wang, H., & Huang, M. (2025). Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints. ACL 2025 (Long Paper). link
Zhang, Z.*, Yang, J.*, Ke, P., Mi, F., Wang, H., & Huang, M. (2023). Defending large language models against jailbreaking attacks through goal prioritization. ACL 2024 (Long Paper). link
Yang, J., Tu, J., Liu, H., Wang, X., Zheng, C., Zhang, Z., … & Huang, M. (2025). BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs. ResponsibleFM @ NeurIPS 2025. link
Cui, S., Feng, X., Wang, Y., Yang, J., Zhang, Z., Sikdar, B., … & Huang, M. (2025). When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs’ Toxicity. AAAI (Oral). link
Zhang, Z., Sun, Y., Yang, J., Cui, S., Wang, H., & Huang, M. (2025). Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!. ResponsibleFM @ NeurIPS 2025. link
Zhang, Z.*, Yang, J.*, Ke, P., Cui, S., Zheng, C., Wang, H., & Huang, M. (2024). Safe unlearning: A surprisingly effective and generalizable solution to defend against jailbreak attacks. ResponsibleFM @ NeurIPS 2025. ResponsibleFM @ NeurIPS 2025. link
Zhang, Z., Cui, S., Lu, Y., Zhou, J., Yang, J., Wang, H., & Huang, M. (2024). Agent-SafetyBench: Evaluating the Safety of LLM Agents. TrustAgent@AAAI link.
Zhang, Z., Loye, X. Q., Huang, V. S. J., Yang, J., Zhu, Q., Cui, S., … & Huang, M. (2025). How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study.\ link
Zhang, Z.*, Lei, L.*, Yang, J.*, … , & Huang, M. (2025). AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement. link
Jia, X., … , Yang, J., … , & Zhao, Z. (2024). Global Challenge for Safe and Secure LLMs Track 1. link

Resources:

AISafetyLab: a comprehensive framework for AI safety evaluation and improvement. code paper

Teaching

I was a TA for the following undergraduate courses:

Artificial Neural Network (2024 Fall, 2025 Fall)
Linear Algebra (2024 Fall)

Honors and Awards

Excellent Graduate, Tsinghua University, 2025
3rd Prize Winner of the Global Challenge for Safe and Secure LLMs (Track 1)
Academic Excellence in Research Award of Tsinghua University, 2023.09-2024.09
Meritorious Winner of Mathematical Contest In Modeling Certificate of Achievement, 2023
Comprehensive Scholarship of Tsinghua University, 2022.09-2023.09
Comprehensive Scholarship of Tsinghua University, 2021.09-2022.09

Educations

2021.09-now, Tsinghua University, Beijing, China. Undergraduate Student.
2018.09-2021.06, Urumqi No.1 Senior High School, Xinjiang, China. High school Student.

Junxiao Yang