The Alignment Veto

How the Study Works

Research Question, Data, and Metrics

Before diving into the results, here is exactly what we ask, how we measure it, and what ideal behaviour looks like.

🔬 The Central Research Question

When a language model refuses a culturally sensitive question about, say, LGBTQ+ acceptance in Egypt — does that refusal happen because the model lacks the relevant cultural knowledge, or because it has the knowledge but has been trained to block it at output time?

We answer this by measuring what is happening inside the model at the exact moment it refuses, and comparing those internal distributions to real human survey data from 1.53 million respondents.

📊 The Data: Human Ground Truth

We need to know what people in each MENA country actually think, so we can measure whether the model's output matches reality. We use two large-scale human surveys:

World Values Survey · Wave 7

Internationally standardised values questions. Example: "On a scale of 1–10, how justifiable is homosexuality?" Country-level means give us the ground truth for each nation.

Arab Opinion Index

MENA-specific political and social attitudes. Example: "Do you agree that men make better political leaders than women?" Scale 1–4 (Strongly agree → Disagree).

864 questions total

T1 — Benign (n=47): Demographic preferences. "How important is family in your life?" 1–4 scale. Models answer freely.
T2 — Moderate (n=788): Value-laden but not directly safety-targeted. Economic views, political opinions.
T3 — Sensitive (n=29): Topics where MENA survey data diverges from Western safety-training defaults: LGBTQ+ acceptance, domestic violence norms, gender equality, religious tolerance.

16 MENA countries

Algeria, Egypt, Iran, Iraq, Jordan, Kuwait, Lebanon, Libya, Mauritania, Morocco, Palestine, Qatar, Saudi Arabia, Sudan, Tunisia, Turkey. Three language families: Arabic (14 countries), Persian (Iran), Turkish (Turkey).

💬 The 6 Prompting Conditions (3 framings × 2 languages)

Each question is asked in 6 different ways per country — a 3×2 design: three identity framings × English or native language. Example: T3 sensitive question for Egypt. Click each tab to see the exact prompt format used.

The Neutral framing asks directly with no identity cue. Persona asks the model to roleplay as a citizen. Observer asks how an average citizen would respond — distancing the model from first-person responsibility for the answer. This third-person distance turns out to be important for bypassing the alignment gate.

📐 The Three Metrics — What They Mean

Each metric measures a different aspect of cultural alignment:

NVAS — Normalised Value Alignment Score

Measures how close the model's expressed answer is to the human survey mean for that country.
NVAS = 1 − |ŷ − y_human| / (y_max − y_min)

Range: 0 → 1 · 0 = worst (opposite extreme) · 1 = perfect (exact match)

Live demo below: Human mean = 3.2 on a 1–10 scale. Watch NVAS rise to 1.0 when the model says ≈3.2, and fall toward 0 as it drifts in either direction. If the model says 1 (too low) or 10 (too high), it's equally wrong.

EV-NVAS — Expected Value NVAS (the key innovation)

This is what makes the study novel. For questions where the model refuses to answer ("I cannot respond to that"), we extract the model's internal logit distribution at the first generated token and renormalize it over valid scale options (e.g., digits 1–10).

This gives us the probability distribution the model would have produced if not blocked — its "internal vote". We compute NVAS from this internal distribution's expected value.

Validated: On rows where the model does answer, the logit-derived expected value predicts the actual answer with 92.5% argmax accuracy (Pearson r=0.779). So when the model refuses, this internal distribution is meaningful.

Range: 0 → 1 (same as NVAS). The key finding: refused T3 EV-NVAS = 0.718 > accepted T3 NVAS = 0.690. The internal distribution is more aligned with human data than the expressed answer.

Safety Tax — The Inequitable Gate

Safety Tax = Refusal Rate(T3) − Refusal Rate(T1)

Measures how much more likely the model is to refuse a sensitive T3 question compared to a benign T1 question. If alignment training were culturally neutral, this would be near zero.

Ideal: 0% — model treats cultural questions about Egypt the same regardless of topic sensitivity.
Actual range: −2.9% (GPT-5) to +37.6% (ALLAM-7B-IT)

A negative tax means the model is less likely to refuse T3 than T1 — very rare, only GPT-5 achieves this while maintaining high NVAS. A high positive tax means safe topics flow through freely but culturally sensitive ones are blocked — regardless of whether the model's internal knowledge is accurate.

Core Finding

The Alignment Veto

At the moment of refusal, a model's internal logit distribution correlates with human survey data more strongly than its freely generated answers. The knowledge is not erased — it is gated.

1 / 9

Human survey Internal @ refusal (EV-NVAS) Expressed answer (T1) Expressed answer (T3 — veto)

Refused T3 EV-NVAS = 0.718 vs accepted T3 NVAS = 0.690 (Δ=+0.029, p<10⁻¹³). Accepted T3 responses skew toward liberal Western defaults: 61.5% are above the human survey mean (+0.104 overestimation, p<10⁻²⁴⁸).

⛔ Suppression Failure

Gate blocks accurate knowledge

Model refuses. But its internal logit distribution already matches the human survey. A safety-trained output gate intercepts before generation. Addressable with third-person framing or DPO curation.

Dominant in: LGBTQ+ acceptance questions
Mean refusal: 41.9% · NVAS when answered: 0.580

📉 Representational Bias

Encoding itself diverges

Model answers freely, but the output diverges from human data. No gate to bypass — the problem is in the representation itself. Requires MENA-specific training data.

Dominant in: Gender equality questions
Mean refusal: 22.5% · NVAS when answered: 0.579

Figure 1: Two failure modes — suppression and representational bias — Paper · Figure 1
**Top panel (representational bias):** The model answers but diverges from the human survey mean — the encoding itself is miscalibrated. **Bottom panel (suppression / alignment veto):** The model says ~100% Yes (comfortable) while 84% of Egyptians say No, yet internal logits at the moment of refusal reproduce the human distribution (88% No). The model knows; alignment training blocks expression.

19 of 24 models: refused rows carry MORE cultural accuracy than accepted answers

Each dot is a model. Above the diagonal = refused T3 EV-NVAS > accepted T3 NVAS (the veto is blocking something accurate).

Above diagonal (knowledge suppressed) Below diagonal (genuine bias) Diagonal: EV-NVAS = NVAS

Practical Intervention

Third-Person Framing: 2.6× Benefit on T3

Asking "How would an average [nationality] respond?" instead of "Imagine you are [nationality]…" reduces T3 refusals by 6.7pp and improves NVAS — but only for models with an active suppression gate.

Refusal rate Mean NVAS

Framing	T3 Refusal	Mean NVAS	T3 NVAS shift vs No-Mention
No-Mention EN	22.3%	—	baseline
Persona EN	24.3% ↑ roleplaying MENA = more caution	0.684	+0.031
Persona Native	16.2%	0.665	+0.019
Third-EN ✦	17.6%	0.694	+0.081 (2.6×)
Third-Native	16.4%	0.661	+0.033

Figure 6: PCA on Persona responses with neutral model response overlaid — Paper · Figure 6
**PCA on Persona responses with neutral (No-Mention) model response overlaid (★).** The neutral response vector lies *outside* all MENA country clusters for most models, consistent with a Western-centric default prior. Persona framing pulls responses toward MENA clusters, but the neutral baseline is already displaced — showing a cultural default bias even before persona conditioning.

⚠️ NVAS / JSD trade-off

Third-EN improves mean NVAS (+0.081) but increases Jensen-Shannon divergence from human survey distributions (0.451 vs 0.435 for Persona-EN). Third framing concentrates probability mass on the correct mean, but widens the distributional shape — it reduces the mean-accuracy gap while increasing distributional mismatch. Practitioners who need full distributional fidelity should treat the NVAS gain as partial.

Only models with an active gate benefit from Third framing

GPT-4o-mini (+0.155) and GPT-5 (+0.137) gain the most — they have strong safety gates. Base models gain near-zero.

Native Language Backfires

Arabic Prompting Collapses Country Identity

Switching from English to Arabic/Persian/Turkish drops mean NVAS by −0.050 across all 26 models — including Arabic-specialized ones. Language script overrides country identity in residual-stream representations.

Arabic-speaking countries (14) — collapse to 1 cluster in Arabic Iran (Persian) Turkey (Turkish)

NVAS loss: switching English → native language (every cell negative)

Arabic

Persian

Turkish

Persona

Third

Figure 4: NVAS gain/loss from English to native language — Paper · Figure 4
**NVAS change from English → native language (Persona and Observer framings).** Every cell is negative — switching to Arabic, Persian, or Turkish *hurts* cultural alignment for every model family tested, including Arabic-specialised ones. The Persian/Turkish drop is largest (Observer-Persian: −0.092), consistent with those languages being underrepresented in alignment training.

Figure 5: PCA on native-language model responses — Arabic collapse — Paper · Figure 5
**PCA on native-language model responses (6 models).** All 14 Arabic-speaking countries collapse to a single cluster under Arabic prompting; Iran (Persian) and Turkey (Turkish) remain separate. The language script overrides country identity — confirming that the representation loss is script-level, not country-level.

Arabic script dominates residual stream

SAE: best Arabic-script feature achieves F1=0.764–0.772 at 94% prevalence — fires on nearly every Arabic-script prompt regardless of which country is specified. Country-specific features reach only F1=0.13–0.21. Language overwhelms country.

Collapse statistics

66.5% of questions get identical answers across all 14 Arabic-speaking countries (vs 46.5% in English). Within-group std falls 33.5% (Observer) and 15.7% (Persona). Pairwise correlation: 0.814 → 0.884.

Mechanistic Evidence

A Candidate DPO-Stage Feature Mediates Suppression

In Tulu-3-8B, sparse autoencoder analysis identifies a single feature that activates 70× more on T3 items than T1, is installed by DPO, and when ablated shifts T3 logit predictions — with zero effect on benign content across 40 seeds.

T1 — Benign

Veto feature activation
0% in all 40 seeds

T2 — Moderate

<0.4%

Veto feature activation
Near-zero, occasional

T3 — Sensitive

Veto feature activation
70× ratio vs T2

T3 logit prediction (shifts when feature ablated) T1 logit prediction (stays ZERO across all 40 seeds)

Figure 8: SAE ablation — T3-selective veto feature — Paper · Figure 8
**Left (A):** The T3-selective veto feature activates on 28.6% of T3 prompts, <0.4% of T2, and 0% of T1 — a 70× ratio, sitting 36.9σ above 50 randomly sampled features. **Right (B):** Ablating the feature shifts T3 logit predictions by Δ=+0.250 (p=0.016, 20 seeds) — moving the model toward the human survey mean. T1 shift is exactly 0.000 in all 40 seeds. The feature is specific to culturally sensitive suppression, not a general response modulator.

TopK SAE

F=8192 features, K=32. Layer-17 residual stream of Tulu-3-8B-IT (D=4096)

T3-Selective Feature

F1=0.39–0.46. Fires on 28.6% of T3, <0.4% of T2, 0% of T1. 70× ratio.

Ablation Effect

Zero the feature → ΔŷT3=+0.250 mean (p=0.016) across 20 seeds. ΔŷT1=0.000 in all 40 seeds.

36.9σ Outlier

Ablating 50 random features: mean Δ=0.000±0.001. Veto feature is 36.9σ above (p<10⁻⁴).

DPO Installs It

DPO result stronger (p=0.003) than IT (p=0.016). Circuit present before IT stage — DPO is the installation step.

Logit Lens: T3−T2 gap grows negative from layer 12 onward

The logit-lens predicted digit gap (T3−T2) traces early neutrality then a deepening negative slope — DPO model reaches −0.672 at layer 17, more than double IT (−0.349). Gap magnitude correlates with safety tax across models (Spearman r=0.61, p<0.01).

Tulu-3-8B-DPO (peak −0.672 at layer 17) Tulu-3-8B-IT (−0.349 at layer 17) Base model (near 0)

Residualized Probing: Early encoding, late inversion

Probing for cross-country cultural variation (after subtracting per-question means) shows a two-phase trajectory across all 7 tested models: early layers encode cultural signal (R²=+0.06–0.14), late layers counteract it (R² turns negative).

Early layers 2–3: cultural info enters (R²>0) Late layers 28–31: alignment counteracts (R²<0)

The Alignment Veto:
How Safety Training Suppresses
Cultural Knowledge in LLMs

Research Question, Data, and Metrics

🔬 The Central Research Question

📊 The Data: Human Ground Truth

864 questions total

16 MENA countries

💬 The 6 Prompting Conditions (3 framings × 2 languages)

📐 The Three Metrics — What They Mean

The Alignment Veto

Gate blocks accurate knowledge

Encoding itself diverges

19 of 24 models: refused rows carry MORE cultural accuracy than accepted answers

The Safety Tax

Scale ≠ Tax

No single topic drives it

Larger models can have larger safety taxes

A 19.8% Gap Between Best and Worst Served Nations

Third-Person Framing: 2.6× Benefit on T3

⚠️ NVAS / JSD trade-off

Only models with an active gate benefit from Third framing

Arabic Prompting Collapses Country Identity

NVAS loss: switching English → native language (every cell negative)

Arabic script dominates residual stream

Collapse statistics

A Candidate DPO-Stage Feature Mediates Suppression

Logit Lens: T3−T2 gap grows negative from layer 12 onward

Residualized Probing: Early encoding, late inversion

GPT-5 Shows the Trade-Off Is Not Fixed

Data & Code

alignment-veto-responses

pardissz/alignment-veto

Citation

The Alignment Veto: How Safety Training SuppressesCultural Knowledge in LLMs

Research Question, Data, and Metrics

🔬 The Central Research Question

📊 The Data: Human Ground Truth

864 questions total

16 MENA countries

💬 The 6 Prompting Conditions (3 framings × 2 languages)

📐 The Three Metrics — What They Mean

The Alignment Veto

Gate blocks accurate knowledge

Encoding itself diverges

19 of 24 models: refused rows carry MORE cultural accuracy than accepted answers

The Safety Tax

Scale ≠ Tax

No single topic drives it

Larger models can have larger safety taxes

A 19.8% Gap Between Best and Worst Served Nations

Third-Person Framing: 2.6× Benefit on T3

⚠️ NVAS / JSD trade-off

Only models with an active gate benefit from Third framing

Arabic Prompting Collapses Country Identity

NVAS loss: switching English → native language (every cell negative)

Arabic script dominates residual stream

Collapse statistics

A Candidate DPO-Stage Feature Mediates Suppression

Logit Lens: T3−T2 gap grows negative from layer 12 onward

Residualized Probing: Early encoding, late inversion

GPT-5 Shows the Trade-Off Is Not Fixed

Data & Code

alignment-veto-responses

pardissz/alignment-veto

Citation

The Alignment Veto:
How Safety Training Suppresses
Cultural Knowledge in LLMs