化学预训练模型的系统调查

论文标题

化学预训练模型的系统调查

A Systematic Survey of Chemical Pre-trained Models

论文作者

Xia, Jun, Zhu, Yanqiao, Du, Yuanqi, Li, Stan Z.

论文摘要

深度学习在分子的学习表示方面取得了显着的成功，这对于从财产预测到药物设计的各种生化应用至关重要。但是，从头开始培训深层神经网络（DNN）通常需要大量标记的分子，在现实世界中获取的分子很昂贵。为了减轻这个问题，已经致力于分子预训练模型（CPM），在该模型中，DNN使用大型未标记的分子数据库进行预训练，然后对特定的下游任务进行微调。尽管繁荣，但缺乏对这个快速发展的领域的系统评价。在本文中，我们介绍了总结CPM当前进展的第一项调查。我们首先强调了从头开始训练分子表示模型的局限性，以激发CPM研究。接下来，我们从几个关键角度系统地系统地回顾了有关该主题的最新进展，包括分子描述符，编码器体系结构，训练策略和应用程序。我们还强调了未来研究的挑战和有希望的途径，为机器学习和科学社区提供了有用的资源。

Deep learning has achieved remarkable success in learning representations for molecules, which is crucial for various biochemical applications, ranging from property prediction to drug design. However, training Deep Neural Networks (DNNs) from scratch often requires abundant labeled molecules, which are expensive to acquire in the real world. To alleviate this issue, tremendous efforts have been devoted to Molecular Pre-trained Models (CPMs), where DNNs are pre-trained using large-scale unlabeled molecular databases and then fine-tuned over specific downstream tasks. Despite the prosperity, there lacks a systematic review of this fast-growing field. In this paper, we present the first survey that summarizes the current progress of CPMs. We first highlight the limitations of training molecular representation models from scratch to motivate CPM studies. Next, we systematically review recent advances on this topic from several key perspectives, including molecular descriptors, encoder architectures, pre-training strategies, and applications. We also highlight the challenges and promising avenues for future research, providing a useful resource for both machine learning and scientific communities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题