质谱蛋白质组学中无诱饵假发现率估计的新混合模型

论文标题

质谱蛋白质组学中无诱饵假发现率估计的新混合模型

New mixture models for decoy-free false discovery rate estimation in mass-spectrometry proteomics

论文作者

Peng, Yisu, Jain, Shantanu, Li, Yong Fuga, Gregus, Michal, Ivanov, Alexander R., Vitek, Olga, Radivojac, Predrag

论文摘要

动机：在基于质谱的蛋白质组学中，光谱识别的错误发现率（FDR）的准确估计是一个核心问题。在过去的二十年中，目标诱饵方法（TDA）和无诱饵方法（DFAS）已被广泛用于估计FDR。 TDA使用诱饵物种的数据库忠实地模拟了不正确的肽 - 光谱匹配（PSM）的分布分布。另一方面，DFA拟合了两个组件混合模型，以了解正确且不正确的PSM得分分布的参数。尽管从概念上讲，这两种方法都会导致实践中的问题，尤其是在将仪器推向极限并产生低碎片效率和低信噪比光谱的实验中。结果：我们引入了一个新的无诱饵框架，以供FDR估算，该框架概括了当前的DFA，同时以类似于TDA的方式利用更多搜索数据。我们的方法依赖于多组分混合物，其中得分分布与正确的PSM相对应，最佳不正确的PSM和第二好的PSMS由偏斜的正常家庭建模。我们从与每个实验频谱相关的最佳和第二好的PSM的得分中得出EM算法来估计这些分布的参数。我们在多个蛋白质组学数据集和HeLa细胞消化案例研究上评估了我们的模型，该案例研究总共超过一百万个光谱。我们提供了改善现有DFA的性能的证据，并提高了TDA的稳定性和速度，而没有任何性能降解。我们建议新策略有可能扩展到肽识别范围并减少所有分析平台上TDA的需求。

Motivation: Accurate estimation of false discovery rate (FDR) of spectral identification is a central problem in mass spectrometry-based proteomics. Over the past two decades, target decoy approaches (TDAs) and decoy-free approaches (DFAs), have been widely used to estimate FDR. TDAs use a database of decoy species to faithfully model score distributions of incorrect peptide-spectrum matches (PSMs). DFAs, on the other hand, fit two-component mixture models to learn the parameters of correct and incorrect PSM score distributions. While conceptually straightforward, both approaches lead to problems in practice, particularly in experiments that push instrumentation to the limit and generate low fragmentation-efficiency and low signal-to-noise-ratio spectra. Results: We introduce a new decoy-free framework for FDR estimation that generalizes present DFAs while exploiting more search data in a manner similar to TDAs. Our approach relies on multi-component mixtures, in which score distributions corresponding to the correct PSMs, best incorrect PSMs, and second-best incorrect PSMs are modeled by the skew normal family. We derive EM algorithms to estimate parameters of these distributions from the scores of best and second-best PSMs associated with each experimental spectrum. We evaluate our models on multiple proteomics datasets and a HeLa cell digest case study consisting of more than a million spectra in total. We provide evidence of improved performance over existing DFAs and improved stability and speed over TDAs without any performance degradation. We propose that the new strategy has the potential to extend beyond peptide identification and reduce the need for TDA on all analytical platforms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题