论文标题
在多个序列比对训练的蛋白质语言模型学习系统发育关系
Protein language models trained on multiple sequence alignments learn phylogenetic relationships
论文作者
论文摘要
自我监督的神经语言模型最近已应用于生物学序列数据,进步的结构,功能和突变效应预测。一些蛋白质语言模型,包括MSA变压器和Alphafold的Evoformer,将进化相关蛋白的多个序列比对作为输入。 MSA Transformer的行专注的简单组合导致了最新的无监督结构接触预测。我们证明,MSA变压器柱的敬意的简单和通用组合与MSAS序列之间的锤距距离密切相关。因此,基于MSA的语言模型编码详细的系统发育关系。我们进一步表明,这些模型可以将编码功能和结构约束的共同进化信号与反映历史意义的系统发育相关性分开。为了评估这一点,我们从POTTS模型中生成了在天然MSA训练的Potts模型中,无论没有系统发育。我们发现,当使用MSA变压器与推断的Potts模型时,无监督的接触预测对系统发育噪声的弹性更大。
Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.