Datashield中的深层生成模型

论文标题

Datashield中的深层生成模型

Deep generative models in DataSHIELD

论文作者

Lenz, Stefan, Binder, Harald

论文摘要

从医学数据中计算统计数据的最佳方法是使用个别患者的数据。在某些情况下，由于隐私限制，难以获得此数据。例如，在德国，未经患者同意，不可能从不同医院进行例行数据。 Datashield软件提供了一种基础架构和一组统计方法，用于分布式数据的联合分析。将包含的算法重新制定，以与参与站点的汇总数据（而不是单个数据）合作。如果未在Datashield中实现所需的算法，也不能以这种方式重新重新重新进行，则使用人工数据是另一种选择。我们提出了一种方法，以及在DataShield上建立的软件实现，以创建人工数据，该数据可以从分布式的单个患者数据中保留复杂的模式。这样的人工患者的这些数据集可以与实际患者无关，然后可以用于联合分析。我们使用深玻尔兹曼机（DBM）作为捕获数据分布的生成模型。对于实施，我们采用朱莉娅编程语言中的“ boltzmannmachines”软件包，并将其包装与Datashield一起使用，该软件包基于R。作为示例性应用程序，我们对综合数据集进行了分布式分析，该数据集模拟了遗传变异数据。可以使用虚拟患者的分层聚类在人工数据中恢复原始数据的模式，以证明该方法的可行性。我们的实现增加了Datashield生成可用于各种分析的人造数据的能力。 g。深度学习的模式识别。这也更笼统地说明了如何使用来自R以外的其他语言的高级算法灵活地扩展Datashield。

The best way to calculate statistics from medical data is to use the data of individual patients. In some settings, this data is difficult to obtain due to privacy restrictions. In Germany, for example, it is not possible to pool routine data from different hospitals for research purposes without the consent of the patients. The DataSHIELD software provides an infrastructure and a set of statistical methods for joint analyses of distributed data. The contained algorithms are reformulated to work with aggregated data from the participating sites instead of the individual data. If a desired algorithm is not implemented in DataSHIELD or cannot be reformulated in such a way, using artificial data is an alternative. We present a methodology together with a software implementation that builds on DataSHIELD to create artificial data that preserve complex patterns from distributed individual patient data. Such data sets of artificial patients, which are not linked to real patients, can then be used for joint analyses. We use deep Boltzmann machines (DBMs) as generative models for capturing the distribution of data. For the implementation, we employ the package "BoltzmannMachines" from the Julia programming language and wrap it for use with DataSHIELD, which is based on R. As an exemplary application, we conduct a distributed analysis with DBMs on a synthetic data set, which simulates genetic variant data. Patterns from the original data can be recovered in the artificial data using hierarchical clustering of the virtual patients, demonstrating the feasibility of the approach. Our implementation adds to DataSHIELD the ability to generate artificial data that can be used for various analyses, e. g. for pattern recognition with deep learning. This also demonstrates more generally how DataSHIELD can be flexibly extended with advanced algorithms from languages other than R.

下载PDF全文

下载文献需遵守相关版权规定

论文标题