通过感知者AR

论文标题

通过感知者AR

General-purpose, long-context autoregressive modeling with Perceiver AR

论文作者

Hawthorne, Curtis, Jaegle, Andrew, Cangea, Cătălina, Borgeaud, Sebastian, Nash, Charlie, Malinowski, Mateusz, Dieleman, Sander, Vinyals, Oriol, Botvinick, Matthew, Simon, Ian, Sheahan, Hannah, Zeghidour, Neil, Alayrac, Jean-Baptiste, Carreira, João, Engel, Jesse

论文摘要

实际数据是高维的：即使在压缩后，书籍，图像或音乐性能也很容易包含数十万个元素。但是，最常用的自回归模型，变压器非常昂贵，以缩放捕获这种远程结构所需的输入和层数。我们开发了感知者AR，这是一种自回归的模态 - 敏捷架构，它使用交叉注意力将远程输入映射到少量潜在的潜在，同时还可以维护端到端的因果关系掩盖。感知者AR可以直接进行十万个令牌，实现了实用的长篇文化密度估计，而无需手工制作的稀疏模式或记忆机制。当对图像或音乐进行培训时，感知器AR会产生具有清晰长期连贯性和结构的输出。我们的架构还获得了长期基准测试的最新可能性，包括64 x 64个Imagenet图像和PG-19书籍。

Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking. Perceiver AR can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms. When trained on images or music, Perceiver AR generates outputs with clear long-term coherence and structure. Our architecture also obtains state-of-the-art likelihood on long-sequence benchmarks, including 64 x 64 ImageNet images and PG-19 books.

下载PDF全文

下载文献需遵守相关版权规定

论文标题