报告摘要:Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet principled explanations for their underlying mechanisms and several phenomena, such as scaling laws, hallucinations, and related behaviors, remain elusive. In this work, we revisit the classical relationship between compression and prediction, grounded in Kolmogorov complexity and Shannon information theory, to provide deeper insights into LLM behaviors. By leveraging the Kolmogorov Structure Function and interpreting LLM compression as a two-part coding process, we offer a detailed view of how LLMs acquire and store information across increasing model and data scales—from pervasive syntactic patterns to progressively rarer knowledge elements. Motivated by this theoretical perspective and natural assumptions inspired by Heap’s and Zipf’s laws, we introduce a simplified yet representative hierarchical data-generation framework called the Syntax-Knowledge model. Under the Bayesian setting, we show that prediction and compression within this model naturally lead to diverse learning and scaling behaviors of LLMs. In particular, our theoretical analysis offers intuitive and principled explanations for both data and model scaling laws, the dynamics of knowledge acquisition during training and fine-tuning, factual knowledge hallucinations in LLMs. The experimental results validate our theoretical predictions.
讲者简介:李建,清华大学交叉信息研究院长聘教授,博士生导师。研究方向为理论计算机科学、人工智能基础理论、金融科技。在主流国际会议和杂志上发表了100余篇论文,并获得了数据库顶级会议VLDB和欧洲算法年会ESA的最佳论文奖、数据库理论会议ICDT最佳新人奖、多篇论文入选口头报告或亮点论文。入选国家级青年人才计划。曾主持或参与了多项自然科学基金项目及企业合作项目。