arXiv 2604.12710 · 2026-04-13
LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck
Junxiao Yang, Haoran Liu, Jinzhe Tu
Identifies a "semantic bottleneck" layer where representations are language-neutral. Anchoring alignment there drops attack success on LLaMA-3.1-8B from 24.7% to 2.8%.
LASA identifies an intermediate “semantic bottleneck” layer where representations become language-neutral, and shows current safety alignment lives further downstream — explaining why low-resource-language jailbreaks succeed.
Anchoring alignment at the bottleneck layer drops attack success rate on LLaMA-3.1-8B-Instruct from 24.7% to 2.8%, with consistent gains across other models.
Practitioner note (mine)
If you ship a multilingual chat or agent product, surface-token safety training is leaky in any language outside English/Chinese. This paper’s contribution is identifying where in the model the alignment is actually anchored — a cleaner intervention point than chasing every locale.
For most builders this is upstream of what you can do (you don’t retrain frontier models). But it informs vendor selection: ask your model provider whether their safety alignment is bottleneck-anchored, especially if your user base spans many languages.