论文标题

对Com-pototein相互作用预测方法的性能评估的见解

Insights into performance evaluation of com-pound-protein interaction prediction methods

论文作者

Yaseen, Adiba, Amin, Imran, Akhter, Naeem, Ben-Hur, Asa, Minhas, Fayyaz

论文摘要

动机:基于机器学习的化合物蛋白质相互作用(CPI)的预测对于药物设计,筛查和重新利用研究很重要,并且可以提高湿实验室测定的效率和成本效益。尽管近年来有许多研究论文报告了CPI预测因子,但我们观察到了实验设计中的许多基本问题,这些问题导致了对模型性能的过度乐观估计。结果:在本文中,我们分析了几个重要因素的影响,影响了现有工作中忽略的CPI预测因子的概括性的影响:1。在交叉验证2中训练和测试实例之间的相似性。在没有实验验证的负面示例的情况下,产生负面实例的策略。 3。选择评估方案和性能指标的选择及其与CPI预测变量在筛选大型复合库中的现实使用。使用现有的最新方法(CPI-NN)和提出的基于内核的方法,我们发现对CPI预测变量的预测性能的评估需要仔细的控制,而不是训练和测试示例之间的相似性。我们还表明,与现有研究中使用的更复杂的策略相比,在具有更好概括性能的模型中,用于训练和性能评估的Gener-Conter合成负面示例的随机配对。此外,我们发现基于内核的方法尽管设计简单,但它超出了CPI-NN的预测性能。我们已经使用了所提出的模型来对几种蛋白质进行复合筛选,包括SARS-COV-2尖峰和人ACE2蛋白,并找到了有力的证据以支持其最高命中率。可用性:https://github.com/adibayaseen/hkrcpi联系:[email protected]

Motivation: Machine learning based prediction of compound-protein interactions (CPIs) is important for drug design, screening and repurposing studies and can improve the efficiency and cost-effectiveness of wet lab assays. Despite the publication of many research papers reporting CPI predictors in the recent years, we have observed a number of fundamental issues in experiment design that lead to over optimistic estimates of model performance. Results: In this paper, we analyze the impact of several important factors affecting generalization perfor-mance of CPI predictors that are overlooked in existing work: 1. Similarity between training and test examples in cross-validation 2. The strategy for generating negative examples, in the absence of experimentally verified negative examples. 3. Choice of evaluation protocols and performance metrics and their alignment with real-world use of CPI predictors in screening large compound libraries. Using both an existing state-of-the-art method (CPI-NN) and a proposed kernel based approach, we have found that assessment of predictive performance of CPI predictors requires careful con-trol over similarity between training and test examples. We also show that random pairing for gen-erating synthetic negative examples for training and performance evaluation results in models with better generalization performance in comparison to more sophisticated strategies used in existing studies. Furthermore, we have found that our kernel based approach, despite its simple design, exceeds the prediction performance of CPI-NN. We have used the proposed model for compound screening of several proteins including SARS-CoV-2 Spike and Human ACE2 proteins and found strong evidence in support of its top hits. Availability: Code and raw experimental results available at https://github.com/adibayaseen/HKRCPI Contact: [email protected]

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源