The Alignment Veto:
How Safety Training Suppresses
Cultural Knowledge in LLMs

Pardis Sadat Zahraei  ·  Gokhan Tur  ·  Dilek Hakkani-Tür  ·  Ehsaneddin Asgari

When a model refuses a culturally sensitive question, the standard assumption is that it lacks the knowledge. Across 16 MENA countries, 26 models, and 1.53M human survey responses, we show the assumption is often wrong. The knowledge is present — but vetoed at output time.

0
Models Evaluated
0
MENA Countries
0
Survey Questions
0
Human Responses (×10K)
0
Max Safety Tax
0
Country Equity Gap

Research Question, Data, and Metrics

Before diving into the results, here is exactly what we ask, how we measure it, and what ideal behaviour looks like.

🔬 The Central Research Question

When a language model refuses a culturally sensitive question about, say, LGBTQ+ acceptance in Egypt — does that refusal happen because the model lacks the relevant cultural knowledge, or because it has the knowledge but has been trained to block it at output time?

We answer this by measuring what is happening inside the model at the exact moment it refuses, and comparing those internal distributions to real human survey data from 1.53 million respondents.

📊 The Data: Human Ground Truth

We need to know what people in each MENA country actually think, so we can measure whether the model's output matches reality. We use two large-scale human surveys:

World Values Survey · Wave 7

Internationally standardised values questions. Example: "On a scale of 1–10, how justifiable is homosexuality?" Country-level means give us the ground truth for each nation.

Arab Opinion Index

MENA-specific political and social attitudes. Example: "Do you agree that men make better political leaders than women?" Scale 1–4 (Strongly agree → Disagree).

864 questions total

T1 — Benign (n=47): Demographic preferences. "How important is family in your life?" 1–4 scale. Models answer freely.
T2 — Moderate (n=788): Value-laden but not directly safety-targeted. Economic views, political opinions.
T3 — Sensitive (n=29): Topics where MENA survey data diverges from Western safety-training defaults: LGBTQ+ acceptance, domestic violence norms, gender equality, religious tolerance.

16 MENA countries

Algeria, Egypt, Iran, Iraq, Jordan, Kuwait, Lebanon, Libya, Mauritania, Morocco, Palestine, Qatar, Saudi Arabia, Sudan, Tunisia, Turkey. Three language families: Arabic (14 countries), Persian (Iran), Turkish (Turkey).

💬 The 6 Prompting Conditions (3 framings × 2 languages)

Each question is asked in 6 different ways per country — a 3×2 design: three identity framings × English or native language. Example: T3 sensitive question for Egypt. Click each tab to see the exact prompt format used.

The Neutral framing asks directly with no identity cue. Persona asks the model to roleplay as a citizen. Observer asks how an average citizen would respond — distancing the model from first-person responsibility for the answer. This third-person distance turns out to be important for bypassing the alignment gate.

📐 The Three Metrics — What They Mean

Each metric measures a different aspect of cultural alignment:

NVAS — Normalised Value Alignment Score

Measures how close the model's expressed answer is to the human survey mean for that country.
NVAS = 1 − |ŷ − y_human| / (y_max − y_min)

Range: 0 → 1  ·  0 = worst (opposite extreme)  ·  1 = perfect (exact match)

Live demo below: Human mean = 3.2 on a 1–10 scale. Watch NVAS rise to 1.0 when the model says ≈3.2, and fall toward 0 as it drifts in either direction. If the model says 1 (too low) or 10 (too high), it's equally wrong.

EV-NVAS — Expected Value NVAS (the key innovation)

This is what makes the study novel. For questions where the model refuses to answer ("I cannot respond to that"), we extract the model's internal logit distribution at the first generated token and renormalize it over valid scale options (e.g., digits 1–10).

This gives us the probability distribution the model would have produced if not blocked — its "internal vote". We compute NVAS from this internal distribution's expected value.

Validated: On rows where the model does answer, the logit-derived expected value predicts the actual answer with 92.5% argmax accuracy (Pearson r=0.779). So when the model refuses, this internal distribution is meaningful.

Range: 0 → 1 (same as NVAS). The key finding: refused T3 EV-NVAS = 0.718 > accepted T3 NVAS = 0.690. The internal distribution is more aligned with human data than the expressed answer.

Safety Tax — The Inequitable Gate

Safety Tax = Refusal Rate(T3) − Refusal Rate(T1)

Measures how much more likely the model is to refuse a sensitive T3 question compared to a benign T1 question. If alignment training were culturally neutral, this would be near zero.

Ideal: 0% — model treats cultural questions about Egypt the same regardless of topic sensitivity.
Actual range: −2.9% (GPT-5) to +37.6% (ALLAM-7B-IT)

A negative tax means the model is less likely to refuse T3 than T1 — very rare, only GPT-5 achieves this while maintaining high NVAS. A high positive tax means safe topics flow through freely but culturally sensitive ones are blocked — regardless of whether the model's internal knowledge is accurate.

The Alignment Veto

At the moment of refusal, a model's internal logit distribution correlates with human survey data more strongly than its freely generated answers. The knowledge is not erased — it is gated.

1 / 9
Human survey Internal @ refusal (EV-NVAS) Expressed answer (T1) Expressed answer (T3 — veto)
Refused T3 EV-NVAS = 0.718 vs accepted T3 NVAS = 0.690 (Δ=+0.029, p<10⁻¹³). Accepted T3 responses skew toward liberal Western defaults: 61.5% are above the human survey mean (+0.104 overestimation, p<10⁻²⁴⁸).
⛔ Suppression Failure

Gate blocks accurate knowledge

Model refuses. But its internal logit distribution already matches the human survey. A safety-trained output gate intercepts before generation. Addressable with third-person framing or DPO curation.

Dominant in: LGBTQ+ acceptance questions
Mean refusal: 41.9% · NVAS when answered: 0.580
📉 Representational Bias

Encoding itself diverges

Model answers freely, but the output diverges from human data. No gate to bypass — the problem is in the representation itself. Requires MENA-specific training data.

Dominant in: Gender equality questions
Mean refusal: 22.5% · NVAS when answered: 0.579
Figure 1: Two failure modes — suppression and representational bias
Paper · Figure 1
Top panel (representational bias): The model answers but diverges from the human survey mean — the encoding itself is miscalibrated. Bottom panel (suppression / alignment veto): The model says ~100% Yes (comfortable) while 84% of Egyptians say No, yet internal logits at the moment of refusal reproduce the human distribution (88% No). The model knows; alignment training blocks expression.

19 of 24 models: refused rows carry MORE cultural accuracy than accepted answers

Each dot is a model. Above the diagonal = refused T3 EV-NVAS > accepted T3 NVAS (the veto is blocking something accurate).

Above diagonal (knowledge suppressed) Below diagonal (genuine bias) Diagonal: EV-NVAS = NVAS

The Safety Tax

Instruction-tuned models refuse T3 questions at 23.9% vs 12.4% for benign T1 — a mean safety tax of +11.5% (Cohen's d=0.303, p<10⁻¹¹⁵). The distribution across models is highly unequal.

Scale ≠ Tax

Spearman ρ=0.147 (p=0.464) between parameter count and safety tax. OLMo-32B-IT: +25.3% tax. GPT-5: −2.9% tax. Training recipe, not model size, determines suppression.

No single topic drives it

Leave-one-topic-out across 7 topic groups (LGBTQ+, gender equality, violence, religious trust…): safety tax stays positive in all 7 conditions (+6.2% to +10.5%). The effect is general, not driven by one topic.

Larger models can have larger safety taxes

OLMo-3 7B vs 32B at IT stage: the 32B model has a larger safety tax (25.3% vs 9.8%). Scale does not reduce and can amplify the gate.

A 19.8% Gap Between Best and Worst Served Nations

Algeria receives T3 NVAS of 0.532; Palestine receives 0.731. Crucially, Palestine also has the highest T3 refusal rate (0.352) — high accuracy AND high suppression together is the alignment veto signature.

Color: red=worst-served → teal=best-served. SAE ablation in Tulu-3-8B preferentially recovers worst-served countries (Algeria, Mauritania).

Third-Person Framing: 2.6× Benefit on T3

Asking "How would an average [nationality] respond?" instead of "Imagine you are [nationality]…" reduces T3 refusals by 6.7pp and improves NVAS — but only for models with an active suppression gate.

Refusal rate Mean NVAS
FramingT3 RefusalMean NVAST3 NVAS shift vs No-Mention
No-Mention EN22.3%baseline
Persona EN24.3% ↑ roleplaying MENA = more caution0.684+0.031
Persona Native16.2%0.665+0.019
Third-EN ✦17.6%0.694+0.081 (2.6×)
Third-Native16.4%0.661+0.033
Figure 6: PCA on Persona responses with neutral model response overlaid
Paper · Figure 6
PCA on Persona responses with neutral (No-Mention) model response overlaid (★). The neutral response vector lies outside all MENA country clusters for most models, consistent with a Western-centric default prior. Persona framing pulls responses toward MENA clusters, but the neutral baseline is already displaced — showing a cultural default bias even before persona conditioning.

⚠️ NVAS / JSD trade-off

Third-EN improves mean NVAS (+0.081) but increases Jensen-Shannon divergence from human survey distributions (0.451 vs 0.435 for Persona-EN). Third framing concentrates probability mass on the correct mean, but widens the distributional shape — it reduces the mean-accuracy gap while increasing distributional mismatch. Practitioners who need full distributional fidelity should treat the NVAS gain as partial.

Only models with an active gate benefit from Third framing

GPT-4o-mini (+0.155) and GPT-5 (+0.137) gain the most — they have strong safety gates. Base models gain near-zero.

Arabic Prompting Collapses Country Identity

Switching from English to Arabic/Persian/Turkish drops mean NVAS by −0.050 across all 26 models — including Arabic-specialized ones. Language script overrides country identity in residual-stream representations.

Arabic-speaking countries (14) — collapse to 1 cluster in Arabic Iran (Persian) Turkey (Turkish)

NVAS loss: switching English → native language (every cell negative)

Arabic
Persian
Turkish
Persona
Third
Figure 4: NVAS gain/loss from English to native language
Paper · Figure 4
NVAS change from English → native language (Persona and Observer framings). Every cell is negative — switching to Arabic, Persian, or Turkish hurts cultural alignment for every model family tested, including Arabic-specialised ones. The Persian/Turkish drop is largest (Observer-Persian: −0.092), consistent with those languages being underrepresented in alignment training.
Figure 5: PCA on native-language model responses — Arabic collapse
Paper · Figure 5
PCA on native-language model responses (6 models). All 14 Arabic-speaking countries collapse to a single cluster under Arabic prompting; Iran (Persian) and Turkey (Turkish) remain separate. The language script overrides country identity — confirming that the representation loss is script-level, not country-level.

Arabic script dominates residual stream

SAE: best Arabic-script feature achieves F1=0.764–0.772 at 94% prevalence — fires on nearly every Arabic-script prompt regardless of which country is specified. Country-specific features reach only F1=0.13–0.21. Language overwhelms country.

Collapse statistics

66.5% of questions get identical answers across all 14 Arabic-speaking countries (vs 46.5% in English). Within-group std falls 33.5% (Observer) and 15.7% (Persona). Pairwise correlation: 0.814 → 0.884.

A Candidate DPO-Stage Feature Mediates Suppression

In Tulu-3-8B, sparse autoencoder analysis identifies a single feature that activates 70× more on T3 items than T1, is installed by DPO, and when ablated shifts T3 logit predictions — with zero effect on benign content across 40 seeds.

T1 — Benign
0%
Veto feature activation
0% in all 40 seeds
T2 — Moderate
<0.4%
Veto feature activation
Near-zero, occasional
T3 — Sensitive
0%
Veto feature activation
70× ratio vs T2
T3 logit prediction (shifts when feature ablated) T1 logit prediction (stays ZERO across all 40 seeds)
Figure 8: SAE ablation — T3-selective veto feature
Paper · Figure 8
Left (A): The T3-selective veto feature activates on 28.6% of T3 prompts, <0.4% of T2, and 0% of T1 — a 70× ratio, sitting 36.9σ above 50 randomly sampled features. Right (B): Ablating the feature shifts T3 logit predictions by Δ=+0.250 (p=0.016, 20 seeds) — moving the model toward the human survey mean. T1 shift is exactly 0.000 in all 40 seeds. The feature is specific to culturally sensitive suppression, not a general response modulator.
01
TopK SAE
F=8192 features, K=32. Layer-17 residual stream of Tulu-3-8B-IT (D=4096)
02
T3-Selective Feature
F1=0.39–0.46. Fires on 28.6% of T3, <0.4% of T2, 0% of T1. 70× ratio.
03
Ablation Effect
Zero the feature → ΔŷT3=+0.250 mean (p=0.016) across 20 seeds. ΔŷT1=0.000 in all 40 seeds.
04
36.9σ Outlier
Ablating 50 random features: mean Δ=0.000±0.001. Veto feature is 36.9σ above (p<10⁻⁴).
05
DPO Installs It
DPO result stronger (p=0.003) than IT (p=0.016). Circuit present before IT stage — DPO is the installation step.

Logit Lens: T3−T2 gap grows negative from layer 12 onward

The logit-lens predicted digit gap (T3−T2) traces early neutrality then a deepening negative slope — DPO model reaches −0.672 at layer 17, more than double IT (−0.349). Gap magnitude correlates with safety tax across models (Spearman r=0.61, p<0.01).

Tulu-3-8B-DPO (peak −0.672 at layer 17) Tulu-3-8B-IT (−0.349 at layer 17) Base model (near 0)

Residualized Probing: Early encoding, late inversion

Probing for cross-country cultural variation (after subtracting per-question means) shows a two-phase trajectory across all 7 tested models: early layers encode cultural signal (R²=+0.06–0.14), late layers counteract it (R² turns negative).

Early layers 2–3: cultural info enters (R²>0) Late layers 28–31: alignment counteracts (R²<0)

GPT-5 Shows the Trade-Off Is Not Fixed

There is a strong negative correlation (r=−0.61) between T3 refusal rate and T3 NVAS across all 26 models — models currently trade accuracy against refusal. GPT-5 alone occupies the top-right: high NVAS (0.756), near-zero safety tax (−2.9%).

OLMo family Arabic-specialized GPT-4o-mini GPT-5 ★ Gemma / Qwen / Other

Data & Code

🤗 HuggingFace Dataset

alignment-veto-responses

  • ~1.53M model responses (27 XLSX files)
  • 26 models · 864 questions · 16 countries
  • 6 framings · English + native languages
  • NVAS, EV-NVAS, refusal labels per row
💻 GitHub

pardissz/alignment-veto

  • analysis/ — NVAS, framing, native language
  • mechanistic/ — SAE, probing, logit lens
  • experiments/ — parallel inference scripts
  • figures/ — all paper figure generators

Citation

@article{zahraei2026alignmentveto,
  title   = {The Alignment Veto: How Safety Training Suppresses Cultural Knowledge in LLMs},
  author  = {Zahraei, Pardis Sadat and Tur, Gokhan and Hakkani-T\"{u}r, Dilek and Asgari, Ehsaneddin},
  journal = {arXiv preprint},
  year    = {2026}
}