旋转光谱和概率深度学习的分子鉴定

论文标题

旋转光谱和概率深度学习的分子鉴定

Molecule Identification with Rotational Spectroscopy and Probabilistic Deep Learning

论文作者

McCarthy, Michael C., Lee, Kin Long Kelvin

论文摘要

提出了一种使用实验旋转数据和概率深度学习来识别未知元素组成和结构分子的概念验证框架。使用实验确定的最小输入数据集，我们描述了四个神经网络体系结构，它们产生信息以帮助识别未知分子。第一个体系结构将光谱参数转化为库仑基质特征谱，作为恢复旋转光谱中编码的化学和结构信息的一种方法。随后，三个深度学习网络将特征性使用来限制石化的范围，产生微笑字符串并预测分子中最可能的官能团。在每个模型中，我们都利用辍学层作为贝叶斯采样的近似，随后从其他确定性模型中产生了概率预测。这些模型经过适度尺寸的理论数据集进行了培训，其中包括$ {\ sim} $ 83,000独特的有机分子（18至180 AMU之间），以$ω$ b97x-$ b97x-d/6-31+g（d）理论的理论不确定的理论不确定的训练，以便于$ω$ b97x-$ b97x-d/6-31+g（d），以便在理论上进行了良好的培训。由于化学和结构特性在很大程度上取决于分子成分，因此我们将数据集分为四组，对应于纯碳氢化合物，含氧，氮和含氧物种，以及氧和氮的含氧物种，训练每种类型的网络，从而在每个分子域中都在每个类别中创建“专家”。我们演示了这些模型如何用于对四个分子的实际推断，并讨论我们方法的优势和缺点，以及这些体系结构可以采取的未来方向。

A proof-of-concept framework for identifying molecules of unknown elemental composition and structure using experimental rotational data and probabilistic deep learning is presented. Using a minimal set of input data determined experimentally, we describe four neural network architectures that yield information to assist in the identification of an unknown molecule. The first architecture translates spectroscopic parameters into Coulomb matrix eigenspectra, as a method of recovering chemical and structural information encoded in the rotational spectrum. The eigenspectrum is subsequently used by three deep learning networks to constrain the range of stoichiometries, generate SMILES strings, and predict the most likely functional groups present in the molecule. In each model, we utilize dropout layers as an approximation to Bayesian sampling, which subsequently generates probabilistic predictions from otherwise deterministic models. These models are trained on a modestly sized theoretical dataset comprising ${\sim}$83,000 unique organic molecules (between 18 and 180 amu) optimized at the $ω$B97X-D/6-31+G(d) level of theory where the theoretical uncertainty of the spectroscopic constants are well understood and used to further augment training. Since chemical and structural properties depend highly on molecular composition, we divided the dataset into four groups corresponding to pure hydrocarbons, oxygen-bearing, nitrogen-bearing, and both oxygen- and nitrogen-bearing species, training each type of network with one of these categories thus creating "experts" within each domain of molecules. We demonstrate how these models can then be used for practical inference on four molecules, and discuss both the strengths and shortcomings of our approach, and the future directions these architectures can take.

下载PDF全文

下载文献需遵守相关版权规定

论文标题