Towards Intrinsic Interpretability of Large Language Models: A Survey of Design Principles and Architectures
Published in Annual Meeting of the Association for Computational Linguistics (ACL 2026 Main Conference), 2026
We present a systematic survey on intrinsic interpretability for Large Language Models. Unlike existing surveys focusing on post-hoc explanation methods, we focus on transparency built directly into model architectures and computations. We categorize existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction.
Yutong Gao*, Qinglin Meng*, Yuan Zhou, Liangming Pan
Download Paper
