使用稀疏矩阵分布的多到人类蛋白序列对齐

论文标题

使用稀疏矩阵分布的多到人类蛋白序列对齐

Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices

论文作者

Selvitopi, Oguz, Ekanayake, Saliya, Guidi, Giulia, Pavlopoulos, Georgios, Azad, Ariful, Buluc, Aydin

论文摘要

识别相似的蛋白质序列是许多计算生物学管道中的核心步骤，例如检测同源蛋白序列，产生相似性蛋白图以进行下游分析，功能注释和基因位置。事实证明，蛋白质相似性搜索的性能和可伸缩性已被证明是许多生物信息学管道中的瓶颈，因为廉价和丰富的测序数据的增加。这项工作提出了一个新的分布式内存软件Pastis。 Pastis依靠稀疏基质计算来有效鉴定可能类似的蛋白质。我们使用分布式的稀疏矩阵来可扩展性，并表明稀疏矩阵基础结构与序列完全分布的词典相结合时，非常适合蛋白质相似性搜索，该字典允许满足远程序列请求。我们的算法将氨基酸序列取代的独特偏差纳入了搜索中，而不会改变基本的稀疏基质模型，从而实现了理想的规模，可扩展到数百万个蛋白质序列。

Identifying similar protein sequences is a core step in many computational biology pipelines such as detection of homologous protein sequences, generation of similarity protein graphs for downstream analysis, functional annotation and gene location. Performance and scalability of protein similarity searches have proven to be a bottleneck in many bioinformatics pipelines due to increases in cheap and abundant sequencing data. This work presents a new distributed-memory software, PASTIS. PASTIS relies on sparse matrix computations for efficient identification of possibly similar proteins. We use distributed sparse matrices for scalability and show that the sparse matrix infrastructure is a great fit for protein similarity searches when coupled with a fully-distributed dictionary of sequences that allows remote sequence requests to be fulfilled. Our algorithm incorporates the unique bias in amino acid sequence substitution in searches without altering the basic sparse matrix model, and in turn, achieves ideal scaling up to millions of protein sequences.

下载PDF全文

下载文献需遵守相关版权规定

论文标题