论文标题
cowclip:将CTR预测模型培训时间从1 GPU减少到12小时到10分钟
CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10 minutes on 1 GPU
论文作者
论文摘要
点击率(CTR)预测任务是预测用户是否会单击推荐的项目。由于每天在线生产令人难以置信的数据,因此加速CTR预测模型培训对于确保最新模型并降低培训成本至关重要。提高训练速度的一种方法是进行大型批次培训。但是,如计算机视觉和自然语言处理任务所示,大量批次的训练很容易受到准确性的丧失。我们的实验表明,在CTR预测神经网络的培训中,以前的缩放规则失败了。为了解决这个问题,我们首先从理论上表明,ID的不同频率在缩放批次尺寸时缩放超参数的挑战。为了在较大的批量设置中稳定训练过程,我们开发了自适应列剪切(CowClip)。它为嵌入式提供了简单有效的缩放规则,该规则使学习率保持不变并扩大L2损失。我们在两个现实世界数据集上使用四个CTR预测网络进行了广泛的实验,并成功地缩放了原始批次大小的128倍,而没有准确的损失。特别是,对于CRITEO数据集的CTR预测模型DEEPFM培训,我们的优化框架将批次的大小从1K扩大到128K,而单个V100 GPU的培训时间超过0.1%,并将训练时间从12小时减少到10分钟。我们的代码位于https://github.com/bytedance/largebatchctr。
The click-through rate (CTR) prediction task is to predict whether a user will click on the recommended item. As mind-boggling amounts of data are produced online daily, accelerating CTR prediction model training is critical to ensuring an up-to-date model and reducing the training cost. One approach to increase the training speed is to apply large batch training. However, as shown in computer vision and natural language processing tasks, training with a large batch easily suffers from the loss of accuracy. Our experiments show that previous scaling rules fail in the training of CTR prediction neural networks. To tackle this problem, we first theoretically show that different frequencies of ids make it challenging to scale hyperparameters when scaling the batch size. To stabilize the training process in a large batch size setting, we develop the adaptive Column-wise Clipping (CowClip). It enables an easy and effective scaling rule for the embeddings, which keeps the learning rate unchanged and scales the L2 loss. We conduct extensive experiments with four CTR prediction networks on two real-world datasets and successfully scaled 128 times the original batch size without accuracy loss. In particular, for CTR prediction model DeepFM training on the Criteo dataset, our optimization framework enlarges the batch size from 1K to 128K with over 0.1% AUC improvement and reduces training time from 12 hours to 10 minutes on a single V100 GPU. Our code locates at https://github.com/bytedance/LargeBatchCTR.