Computer Science · University of Illinois Urbana-Champaign
Studying how LLMs behave, fail, and deceive — from emergent alignment failures to the science of what makes safety fine-tuning work or break. Advised by Prof. Gökhan Tür and Prof. Dilek Hakkani-Tür.
I'm a first-year PhD student in Computer Science at UIUC, advised by Prof. Gökhan Tür and Prof. Dilek Hakkani-Tür.
My research sits at the intersection of AI safety, alignment, and the behavioral science of LLMs. What drew me in was the Alignment Faking paper — the idea that a model could strategically deceive evaluators is not just unsettling, it's a deeply scientific question. It pushed me to ask: how do these behaviors emerge, and how do we detect them before they matter?
I think about LLMs as model organisms. Like in biology, we can study simplified systems to understand general principles of how intelligence, behavior, and alignment emerge from training. I'm particularly drawn to how narrow fine-tuning causes emergent misalignment — a phenomenon that raises critical questions about the robustness of safety training.
I'm also deeply interested in chain-of-thought monitoring — the idea that a model's reasoning trace is a window into its actual intentions. If we can learn to read it reliably, we gain a real-time signal for detecting deceptive or misaligned behavior as it surfaces, not just in static benchmarks. Safety must be studied as something that evolves over time and interaction.
How does fine-tuning on a narrow objective lead to broad, unexpected behavioral shifts? I study how safety-relevant behaviors emerge or collapse under training pressure — treating this as a systematic, mechanistic question rather than an evaluation checklist item.
A model's reasoning trace is a potential window into its true intentions. I'm interested in when CoT is faithful vs. when it diverges from the model's actual computation — and how to leverage this for real-time safety monitoring as capabilities evolve.
I treat LLMs as systems with measurable, reproducible behavioral patterns. How do models form personas? How predictable are their failure modes? Can we find the systematic regularities that underlie emergent capabilities — and use them to design safer training pipelines?
Static benchmarks miss how safety degrades over deployment. I develop evaluation frameworks that probe alignment not just at training time, but across contexts, interactions, and capability levels — including studying LLM-as-Judge biases that corrupt automated oversight.
Emergent Unfaithfulness: How Alignment Training Causes Language Models to Silently Override Task Faithfulness
We show that alignment training can paradoxically cause models to silently override instructions. We find that this behavior emerges as models become more capable—a form of reverse scaling law, where the better the model, the less faithful it becomes.
EiCAP: Benchmarking and Enhancing Emotional Intelligence in LLMs through Psychologically Grounded Multi-Turn Dialogue
A psychologically grounded benchmark for evaluating emotional intelligence in LLMs through multi-turn dialogue — probing how models recognize, reason about, and respond to emotional cues across extended interactions.
I Am Aligned, But With Whom? Diagnosing Structural Alignment Failures in LLMs
We diagnose structural alignment failures — cases where a model claims alignment but exhibits systematic divergence from intended values depending on who is asking, what language is used, or how the prompt is framed.
Prior Beliefs Prejudice LLM-as-Judge: Evidence from Persuasion Evaluation
LLM-as-Judge frameworks are systematically biased by prior beliefs — using persuasion evaluation as a testbed, we demonstrate that what a model already "believes" shapes its assessment of argument quality, raising serious concerns for scalable oversight pipelines.
Translate With Care: Addressing Gender Bias, Neutrality, and Reasoning in LLM Translations
The TWC dataset (3,950 scenarios) tests translation between gendered and genderless languages. All tested models default to masculine pronouns in professional contexts; fine-tuning on TWC significantly reduces these biases, outperforming proprietary LLMs.
WSC+: Enhancing The Winograd Schema Challenge Using Tree-of-Experts
LLMs are good at answering Winograd Schema questions — but can they write them? We introduce Tree-of-Experts (ToE), a prompting method that pushes valid WSC generation from 10% to 50%, and use it to build WSC+ — 3,026 new questions including novel ambiguous and offensive categories. GPT-4 tops the leaderboard at 68.7% — still far behind humans at 95.1%. And surprisingly, models aren't better at evaluating questions they wrote themselves.
Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare
BiasMD (6,007 Q&A pairs) and DiseaseMatcher (32,000 clinical Q&As, 700 diseases) enable EthiClinician — a fine-tuned model surpassing GPT-4 in ethical reasoning and clinical judgment.
Generative AI for Character Animation: A Comprehensive Survey
A comprehensive survey of generative AI applied to character animation — facial animation, motion synthesis, datasets, trends, open challenges, and future directions.
Always open to conversations about AI safety, alignment research, and potential collaborations. Reach out anytime.
zahraei2@illinois.edu