论文标题
显微镜下的缩放定律:通过小型实验预测变压器性能
Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments
论文作者
论文摘要
神经缩放定律定义了模型的参数计数与以权力定律形式进行训练后的性能之间的可预测关系。但是,迄今为止,大多数研究尚未明确研究是否可以使用缩放定律来加速模型开发。在这项工作中,我们对广泛的语言理解任务进行了这样的经验研究,从几乎10K参数的模型开始,并在9个语言理解任务中评估下游性能。我们发现,在某些NLP任务中,在填充时间出现了缩放定律,并且在训练大型模型时也可以利用它们来调试收敛。此外,对于规模规律的任务,它们可用于预测较大模型的性能,从而实现有效的模型选择。但是,揭示缩放定律需要仔细的高参数调整和多次运行,以进行不确定性估计,这会产生额外的开销,部分抵消了计算益处。
Neural scaling laws define a predictable relationship between a model's parameter count and its performance after training in the form of a power law. However, most research to date has not explicitly investigated whether scaling laws can be used to accelerate model development. In this work, we perform such an empirical investigation across a wide range of language understanding tasks, starting from models with as few as 10K parameters, and evaluate downstream performance across 9 language understanding tasks. We find that scaling laws emerge at finetuning time in some NLP tasks, and that they can also be exploited for debugging convergence when training large models. Moreover, for tasks where scaling laws exist, they can be used to predict the performance of larger models, which enables effective model selection. However, revealing scaling laws requires careful hyperparameter tuning and multiple runs for the purpose of uncertainty estimation, which incurs additional overhead, partially offsetting the computational benefits.