是什么使卷积模型在长序列建模上很棒？

论文标题

是什么使卷积模型在长序列建模上很棒？

What Makes Convolutional Models Great on Long Sequence Modeling?

论文作者

Li, Yuhong, Cai, Tianle, Zhang, Yi, Chen, Deming, Dey, Debadeepta

论文摘要

卷积模型已被广泛用于多个领域。但是，大多数现有模型仅使用本地卷积，使该模型无法有效地处理远程依赖。注意通过汇总全局信息来克服这个问题，但也使计算复杂性与序列长度二次。最近，Gu等人。 [2021]提出了一个由状态空间模型启发的称为S4的模型。 S4可以有效地实现为一个全局卷积模型，其内核大小等于输入序列长度。 S4可以比变形金刚建模更长的序列，并在几个远程任务上获得SOTA的显着增长。尽管取得了经验成功，但S4仍参与其中。它需要复杂的参数化和初始化方案。结果，S4不太直观且难以使用。在这里，我们的目标是揭开S4的神秘面纱，并提取有助于S4作为全球卷积模型成功的基本原则。我们专注于卷积内核的结构，并确定S4所享有的两个关键但直觉的原理，足以构成有效的全球卷积模型：1）卷积内核的参数化需要有效，因为参数的数量应以序列长度缩放次级性。 2）内核需要满足腐烂的结构，即与更近的邻居进行卷动的权重大于较远的邻居。基于这两个原则，我们提出了一个简单而有效的卷积模型，称为结构化全球卷积（SGCONV）。 SGCONV在几个任务上表现出强大的经验表现：1）速度更快，SGCONV在远程竞技场和语音命令数据集上超过S4。 2）将SGCONV插入标准语言和视觉模型时，它显示了提高效率和性能的潜力。

Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependency efficiently. Attention overcomes this problem by aggregating global information but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. [2021] proposed a model called S4 inspired by the state space model. S4 can be efficiently implemented as a global convolutional model whose kernel size equals the input sequence length. S4 can model much longer sequences than Transformers and achieve significant gains over SoTA on several long-range tasks. Despite its empirical success, S4 is involved. It requires sophisticated parameterization and initialization schemes. As a result, S4 is less intuitive and hard to use. Here we aim to demystify S4 and extract basic principles that contribute to the success of S4 as a global convolutional model. We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure that the weights for convolving with closer neighbors are larger than the more distant ones. Based on the two principles, we propose a simple yet effective convolutional model called Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over several tasks: 1) With faster speed, SGConv surpasses S4 on Long Range Arena and Speech Command datasets. 2) When plugging SGConv into standard language and vision models, it shows the potential to improve both efficiency and performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题