论文标题
SIMSCOOD:微调源代码模型中分布外概括的系统分析
SimSCOOD: Systematic Analysis of Out-of-Distribution Generalization in Fine-tuned Source Code Models
论文作者
论文摘要
对于预训练源代码模型,大型代码数据集已变得越来越访问。但是,对于微调阶段,由于特定于任务的性质和有限的标签资源,获得完全涵盖特定下游任务代码分布的代表性培训数据仍然具有挑战性。此外,经过验证的模型可能会导致忘记先前获得的预训练知识。这些导致尚未系统地研究的意外模型推理行为导致分布(OOD)的概括问题。在本文中,我们贡献了第一种系统的方法,该方法模拟了沿源代码数据属性不同维度的各种OOD方案,并研究了这种情况下的微调模型行为。我们研究了不同微调方法的模型的行为,包括完整的微调和低级适应(LORA)微调方法。我们的全面分析是对四个最先进的预处理模型进行的,并应用于两个代码生成任务,揭示了归因于OOD泛化问题的多种故障模式。此外,我们的分析发现,洛拉微调始终表现出比在各种情况下进行全面微调的OOD概括性能要好得多。
Large code datasets have become increasingly accessible for pre-training source code models. However, for the fine-tuning phase, obtaining representative training data that fully covers the code distribution for specific downstream tasks remains challenging due to the task-specific nature and limited labeling resources. Moreover, fine-tuning pretrained models can result in forgetting previously acquired pre-training knowledge. These lead to out-of-distribution (OOD) generalization issues with unexpected model inference behaviors that have not been systematically studied yet. In this paper, we contribute the first systematic approach that simulates various OOD scenarios along different dimensions of source code data properties and study the fine-tuned model behaviors in such scenarios. We investigate the behaviors of models under different fine-tuning methodologies, including full fine-tuning and Low-Rank Adaptation (LoRA) fine-tuning methods. Our comprehensive analysis, conducted on four state-of-the-art pretrained models and applied to two code generation tasks, exposes multiple failure modes attributed to OOD generalization issues. Additionally, our analysis uncovers that LoRA fine-tuning consistently exhibits significantly better OOD generalization performance than full fine-tuning across various scenarios.