论文标题
预先训练的模型能否真正学习更好的AI辅助药物发现分子表示?
Can Pre-trained Models Really Learn Better Molecular Representations for AI-aided Drug Discovery?
论文作者
论文摘要
自我监督的预训练在AI辅助药物发现中越来越受欢迎,从而导致越来越多的预训练模型,他们承诺可以为分子提取更好的特征表现。然而,尚未完全探索学习表现的质量。在这项工作中,受到传统定量结构 - 活性关系(QSAR)分析的两个活动悬崖现象(AC)和脚手架跳跃(SH)的启发,我们提出了一种名为“代表代表性关系分析(REPRA)的方法,以评估由预训练模型提取的表示的质量,并在代表模型中提取的代表性和形象之间的关系和代表性之间的关系。 ACS和SH的概念从结构活动上下文中概括为代表性的背景,理论上分析了Repra的基本原理。设计了两个分数来测量广义AC和由Repra检测到的SH,因此可以评估表示的质量。在实验中,分析了从7种预训练模型产生的10个目标任务的分子表示。结果表明,最先进的预培训模型可以克服规范扩展连接性指纹(ECFP)的某些缺点,而表示空间和特定分子亚结构的基础之间的相关性并不明确。因此,某些表示可能比规范指纹更糟。我们的方法使研究人员能够评估其提出的自我监管的预训练模型产生的分子表示的质量。我们的发现可以指导社区开发更好的培训技术,以使ACS和SH的发生正常。
Self-supervised pre-training is gaining increasingly more popularity in AI-aided drug discovery, leading to more and more pre-trained models with the promise that they can extract better feature representations for molecules. Yet, the quality of learned representations have not been fully explored. In this work, inspired by the two phenomena of Activity Cliffs (ACs) and Scaffold Hopping (SH) in traditional Quantitative Structure-Activity Relationship (QSAR) analysis, we propose a method named Representation-Property Relationship Analysis (RePRA) to evaluate the quality of the representations extracted by the pre-trained model and visualize the relationship between the representations and properties. The concepts of ACs and SH are generalized from the structure-activity context to the representation-property context, and the underlying principles of RePRA are analyzed theoretically. Two scores are designed to measure the generalized ACs and SH detected by RePRA, and therefore the quality of representations can be evaluated. In experiments, representations of molecules from 10 target tasks generated by 7 pre-trained models are analyzed. The results indicate that the state-of-the-art pre-trained models can overcome some shortcomings of canonical Extended-Connectivity FingerPrints (ECFP), while the correlation between the basis of the representation space and specific molecular substructures are not explicit. Thus, some representations could be even worse than the canonical fingerprints. Our method enables researchers to evaluate the quality of molecular representations generated by their proposed self-supervised pre-trained models. And our findings can guide the community to develop better pre-training techniques to regularize the occurrence of ACs and SH.