使用深双向LSTM模型的局部性敏感基于哈希的序列对齐

论文标题

使用深双向LSTM模型的局部性敏感基于哈希的序列对齐

Locality Sensitive Hashing-based Sequence Alignment Using Deep Bidirectional LSTM Models

论文作者

Tavakoli, Neda

论文摘要

双向长期记忆（LSTM）是一种特殊的复发性神经网络（RNN）体系结构，旨在比RNN更精确地建模序列及其远距离依赖性。本文建议将深双向LSTM用于序列建模作为执行基于位置敏感的散列（LSH）的序列比对的方法。特别是，我们使用深层双向LSTM来学习LSH的功能。然后可以利用获得的LSH执行序列比对。我们通过对参考基因组的简短读取查询对齐，使用所提出的基于LSTM的模型来证明建模序列的可行性。除了使用Illumina测序技术生成的一组简短读取外，我们还将人类参考基因组作为培训数据集。最终目标是将查询序列对准参考基因组。我们首先将参考基因组分解为多个序列。然后将这些序列馈入双向LSTM模型，然后映射到固定长度向量中。这些向量是我们所说的训练有素的LSH，然后可以将其用于序列对齐。案例研究表明，使用引入的基于LSTM的模型，我们可以通过时期数量获得更高的精度。

Bidirectional Long Short-Term Memory (LSTM) is a special kind of Recurrent Neural Network (RNN) architecture which is designed to model sequences and their long-range dependencies more precisely than RNNs. This paper proposes to use deep bidirectional LSTM for sequence modeling as an approach to perform locality-sensitive hashing (LSH)-based sequence alignment. In particular, we use the deep bidirectional LSTM to learn features of LSH. The obtained LSH is then can be utilized to perform sequence alignment. We demonstrate the feasibility of the modeling sequences using the proposed LSTM-based model by aligning the short read queries over the reference genome. We use the human reference genome as our training dataset, in addition to a set of short reads generated using Illumina sequencing technology. The ultimate goal is to align query sequences into a reference genome. We first decompose the reference genome into multiple sequences. These sequences are then fed into the bidirectional LSTM model and then mapped into fixed-length vectors. These vectors are what we call the trained LSH, which can then be used for sequence alignment. The case study shows that using the introduced LSTM-based model, we achieve higher accuracy with the number of epochs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题