优化深度神经网络迭代修剪的学习率时间表

论文标题

优化深度神经网络迭代修剪的学习率时间表

Optimizing Learning Rate Schedules for Iterative Pruning of Deep Neural Networks

论文作者

Liu, Shiyu, Ghosh, Rohan, Min, John Tan Chong, Motani, Mehul

论文摘要

在最近的几项工作中，已经观察到学习率（LR）时间表在网络修剪方面的重要性。例如，Frankle和Carbin（2019）强调，如果不使用LR热身计划和Renda，Frankle和Carbin（2020），就找不到获胜的门票（即保留子网的准确性），这表明在每个降低周期的最初状态下，将LR恢复到最初的状态。在本文中，我们首先为LR时间表的令人惊讶的效果提供理论理由，从而更进一步。接下来，我们提出了一个名为Silo的网络修剪的LR计划，该计划代表S形改进的学习率优化。筒仓比现有的最新（SOTA）LR计划的优势是两个方面：（i）筒仓具有强大的理论动机，并在修剪过程中动态调整LR以改善概括。具体而言，筒仓在S形中增加了LR上限（MAX_LR）。这导致在流行数据集（例如ImageNet，CIFAR -10/100）上的各种类型的网络（例如，视觉变形金刚，Resnet）的广泛实验中提高了2％-4％。（ii）除了强大的理论动机外，在匹配甲骨文的意义上，筒仓在经验上是最佳的，该甲骨可以通过网格搜索详尽地搜索max_lr的最佳值。我们发现，筒仓能够精确地调整Max_lr的值位于Oracle优化的间隔内，从而使性能与Oracle具有明显较低的复杂性竞争。

The importance of learning rate (LR) schedules on network pruning has been observed in a few recent works. As an example, Frankle and Carbin (2019) highlighted that winning tickets (i.e., accuracy preserving subnetworks) can not be found without applying a LR warmup schedule and Renda, Frankle and Carbin (2020) demonstrated that rewinding the LR to its initial state at the end of each pruning cycle improves performance. In this paper, we go one step further by first providing a theoretical justification for the surprising effect of LR schedules. Next, we propose a LR schedule for network pruning called SILO, which stands for S-shaped Improved Learning rate Optimization. The advantages of SILO over existing state-of-the-art (SOTA) LR schedules are two-fold: (i) SILO has a strong theoretical motivation and dynamically adjusts the LR during pruning to improve generalization. Specifically, SILO increases the LR upper bound (max_lr) in an S-shape. This leads to an improvement of 2% - 4% in extensive experiments with various types of networks (e.g., Vision Transformers, ResNet) on popular datasets such as ImageNet, CIFAR-10/100. (ii) In addition to the strong theoretical motivation, SILO is empirically optimal in the sense of matching an Oracle, which exhaustively searches for the optimal value of max_lr via grid search. We find that SILO is able to precisely adjust the value of max_lr to be within the Oracle optimized interval, resulting in performance competitive with the Oracle with significantly lower complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题