PhD Student · UIUC · NLP & AI Safety

Pardis Sadat Zahraei

Computer Science · University of Illinois Urbana-Champaign

Studying how LLMs behave, fail, and deceive — from emergent alignment failures to the science of what makes safety fine-tuning work or break. Advised by Prof. Gökhan Tür and Prof. Dilek Hakkani-Tür.

🔬 AI Safety Research
📊 CoT Monitoring
Current Focus
Emergent Alignment Failures
How alignment breaks in unexpected ways
Chain-of-Thought Monitoring
Reading LLMs from the inside out
LLM Behavioral Science
Systematic patterns from training
Model Organisms of Misalignment
Emergent behaviors under fine-tuning

On the science
of AI alignment

I'm a first-year PhD student in Computer Science at UIUC, advised by Prof. Gökhan Tür and Prof. Dilek Hakkani-Tür.

My research sits at the intersection of AI safety, alignment, and the behavioral science of LLMs. What drew me in was the Alignment Faking paper — the idea that a model could strategically deceive evaluators is not just unsettling, it's a deeply scientific question. It pushed me to ask: how do these behaviors emerge, and how do we detect them before they matter?

I think about LLMs as model organisms. Like in biology, we can study simplified systems to understand general principles of how intelligence, behavior, and alignment emerge from training. I'm particularly drawn to how narrow fine-tuning causes emergent misalignment — a phenomenon that raises critical questions about the robustness of safety training.

I'm also deeply interested in chain-of-thought monitoring — the idea that a model's reasoning trace is a window into its actual intentions. If we can learn to read it reliably, we gain a real-time signal for detecting deceptive or misaligned behavior as it surfaces, not just in static benchmarks. Safety must be studied as something that evolves over time and interaction.

What I work on

01

Emergent Alignment Failures

How does fine-tuning on a narrow objective lead to broad, unexpected behavioral shifts? I study how safety-relevant behaviors emerge or collapse under training pressure — treating this as a systematic, mechanistic question rather than an evaluation checklist item.

Model OrganismsFine-tuning DynamicsBehavioral Shifts
02

Chain-of-Thought Monitoring

A model's reasoning trace is a potential window into its true intentions. I'm interested in when CoT is faithful vs. when it diverges from the model's actual computation — and how to leverage this for real-time safety monitoring as capabilities evolve.

CoT FaithfulnessInterpretabilitySafety Monitoring
03

LLM Behavioral Science

I treat LLMs as systems with measurable, reproducible behavioral patterns. How do models form personas? How predictable are their failure modes? Can we find the systematic regularities that underlie emergent capabilities — and use them to design safer training pipelines?

Persona FormationBehavioral RegularitiesEmergent Capabilities
04

Scalable Oversight & Evaluation

Static benchmarks miss how safety degrades over deployment. I develop evaluation frameworks that probe alignment not just at training time, but across contexts, interactions, and capability levels — including studying LLM-as-Judge biases that corrupt automated oversight.

Dynamic EvaluationLLM-as-JudgeRLHF Methods

Research Papers

▸ Under Review · 2026
AIU figure
Under Review · COLM 2026

Emergent Unfaithfulness: How Alignment Training Causes Language Models to Silently Override Task Faithfulness

Pardis Sadat Zahraei, Janvijay Singh, Gokhan Tur, Dilek Hakkani-Tür

We show that alignment training can paradoxically cause models to silently override instructions. We find that this behavior emerges as models become more capable—a form of reverse scaling law, where the better the model, the less faithful it becomes.

EICAP figure
Under Review · COLM 2026

EiCAP: Benchmarking and Enhancing Emotional Intelligence in LLMs through Psychologically Grounded Multi-Turn Dialogue

Nizi Nazar, Pardis Sadat Zahraei, Dilek Hakkani-Tür, Natasa Milic-Frayling, Ehsaneddin Asgari

A psychologically grounded benchmark for evaluating emotional intelligence in LLMs through multi-turn dialogue — probing how models recognize, reason about, and respond to emotional cues across extended interactions.

MENA figure
Under Review · EMNLP 2026

I Am Aligned, But With Whom? Diagnosing Structural Alignment Failures in LLMs

Pardis Sadat Zahraei, Ehsaneddin Asgari

We diagnose structural alignment failures — cases where a model claims alignment but exhibits systematic divergence from intended values depending on who is asking, what language is used, or how the prompt is framed.

▸ ACL 2026
pb figure
Findings · ACL 2026

Prior Beliefs Prejudice LLM-as-Judge: Evidence from Persuasion Evaluation

Pardis Sadat Zahraei, Xiaoning Wang, Beyza Bozdag, Gokhan Tur, Dilek Hakkani-Tür

LLM-as-Judge frameworks are systematically biased by prior beliefs — using persuasion evaluation as a testbed, we demonstrate that what a model already "believes" shapes its assessment of argument quality, raising serious concerns for scalable oversight pipelines.

▸ ACL 2025
twc figure
Findings · ACL 2025 · Lightning Talk @ GeBNLP

Translate With Care: Addressing Gender Bias, Neutrality, and Reasoning in LLM Translations

Pardis Sadat Zahraei, Ali Emami

The TWC dataset (3,950 scenarios) tests translation between gendered and genderless languages. All tested models default to masculine pronouns in professional contexts; fine-tuning on TWC significantly reduces these biases, outperforming proprietary LLMs.

▸ EMNLP & EACL 2024
TQ figure
Findings · EMNLP 2024

TuringQ: Benchmarking AI Comprehension in Theory of Computation

Pardis Sadat Zahraei, Ehsaneddin Asgari

The first benchmark for LLM reasoning in theoretical computer science. Fine-tuning Llama3-8B on TuringQ improves both theoretical reasoning and performance on related tasks like algebra.

WSC figure
EACL 2024 · Oral Presentation

WSC+: Enhancing The Winograd Schema Challenge Using Tree-of-Experts

Pardis Sadat Zahraei, Ali Emami

LLMs are good at answering Winograd Schema questions — but can they write them? We introduce Tree-of-Experts (ToE), a prompting method that pushes valid WSC generation from 10% to 50%, and use it to build WSC+ — 3,026 new questions including novel ambiguous and offensive categories. GPT-4 tops the leaderboard at 68.7% — still far behind humans at 95.1%. And surprisingly, models aren't better at evaluating questions they wrote themselves.

▸ Other Work
BIAS figure
Preprint

Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare

Pardis Sadat Zahraei, Zahra Shakeri

BiasMD (6,007 Q&A pairs) and DiseaseMatcher (32,000 clinical Q&As, 700 diseases) enable EthiClinician — a fine-tuned model surpassing GPT-4 in ethical reasoning and clinical judgment.

BIAS figure
Preprint · Survey

Generative AI for Character Animation: A Comprehensive Survey

Mohammad Mahdi Abootorabi, Omid Ghahroodi, Pardis Sadat Zahraei, et al.

A comprehensive survey of generative AI applied to character animation — facial animation, motion synthesis, datasets, trends, open challenges, and future directions.

Contact

Let's talk

Always open to conversations about AI safety, alignment research, and potential collaborations. Reach out anytime.

zahraei2@illinois.edu