论文标题
信息理论措施确定蛋白质构型空间的准确低分辨率表示
Information-theoretical measures identify accurate low-resolution representations of protein configurational space
论文作者
论文摘要
使用稳步增长的计算能力来对生物大分子进行分子动力学模拟,同时代表了巨大的机会和巨大的挑战。实际上,生成了大量数据,必须从中提取有用,合成和可理解的信息,以使从了解到理解的关键步骤。在这里,我们解决了在分子动力学仿真过程中蛋白质采样的构象空间的问题。我们应用了不同的方案来聚集蛋白质模拟数据集的框架。然后,我们根据解决方案和相关性的概念采用了一个信息理论框架来衡量各种聚类方法如何完成对配置空间的简化。我们的方法使我们能够确定最佳平衡简单性和信息性的决议水平;此外,我们发现最精确的聚类程序是那些诱导低分辨率空间的超级结构的聚类程序,这与蛋白质构象景观具有自相似组织的假设一致。提出的策略是一般的,其适用性超出了计算生物物理学的范围,使其成为从大数据集中提取有用信息的有价值的工具。
A steadily growing computational power is employed to perform molecular dynamics simulations of biological macromolecules, which represents at the same time an immense opportunity and a formidable challenge. In fact, large amounts of data are produced, from which useful, synthetic, and intelligible information has to be extracted to make the crucial step from knowing to understanding. Here we tackled the problem of coarsening the conformational space sampled by proteins in the course of molecular dynamics simulations. We applied different schemes to cluster the frames of a dataset of protein simulations; we then employed an information-theoretical framework, based on the notion of resolution and relevance, to gauge how well the various clustering methods accomplish this simplification of the configurational space. Our approach allowed us to identify the level of resolution that optimally balances simplicity and informativeness; furthermore, we found that the most physically accurate clustering procedures are those that induce an ultrametric structure of the low-resolution space, consistently with the hypothesis that the protein conformational landscape has a self-similar organisation. The proposed strategy is general and its applicability extends beyond that of computational biophysics, making it a valuable tool to extract useful information from large datasets.