arXiv 2604.12710 · 2026-04-13

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck

Junxiao Yang, Haoran Liu, Jinzhe Tu

Identifies a "semantic bottleneck" layer where representations are language-neutral. Anchoring alignment there drops attack success on LLaMA-3.1-8B from 24.7% to 2.8%.

arxiv.org/abs/2604.12710 ↗

LASA identifies an intermediate “semantic bottleneck” layer where representations become language-neutral, and shows current safety alignment lives further downstream — explaining why low-resource-language jailbreaks succeed.

Anchoring alignment at the bottleneck layer drops attack success rate on LLaMA-3.1-8B-Instruct from 24.7% to 2.8%, with consistent gains across other models.

Practitioner note (mine)

If you ship a multilingual chat or agent product, surface-token safety training is leaky in any language outside English/Chinese. This paper’s contribution is identifying where in the model the alignment is actually anchored — a cleaner intervention point than chasing every locale.

For most builders this is upstream of what you can do (you don’t retrain frontier models). But it informs vendor selection: ask your model provider whether their safety alignment is bottleneck-anchored, especially if your user base spans many languages.