论文标题

一组具有指定相关结构的高维二进制数据生成高维二进制数据的方法

A set of efficient methods to generate high-dimensional binary data with specified correlation structures

论文作者

Jiang, Wei, Song, Shuang, Hou, Lin, Zhao, Hongyu

论文摘要

高维相关的二元数据在许多领域都会出现,例如在生物医学研究中观察到的遗传变异。数据模拟可以帮助研究人员评估效率并探索不同计算和统计方法的属性。同样,一些统计方法(例如蒙特卡洛方法)依赖于数据模拟。 Lunn and Davies(1998)提出了线性时间复杂性方法,以生成具有三个常见相关结构的相关二进制变量。但是,在其方法中指定不平等概率是不可行的。在本手稿中,我们引入了几种具有指定相关结构和不相等概率的高维二进制数据的计算有效算法。我们的算法相对于三个常用的相关结构的维度具有线性时间复杂性,即可交换,衰减产物和K依赖性相关结构。此外,我们扩展了算法以生成具有二次时间复杂性的一般非负相关矩阵的二元数据。我们提供一个R软件包Corbin来实施我们的仿真方法。与二进制数据生成的现有包装相比,生成具有共同相关结构和一般相关矩阵的100维二进制矢量的时间成本可以分别降低至$ 10^5 $折叠和$ 10^3 $折叠,并且可以随着尺寸的增加而进一步提高效率。 r套件Corbin可在cran上找到,网址为https://cran.r-project.org/。

High dimensional correlated binary data arise in many areas, such as observed genetic variations in biomedical research. Data simulation can help researchers evaluate efficiency and explore properties of different computational and statistical methods. Also, some statistical methods, such as Monte-Carlo methods, rely on data simulation. Lunn and Davies (1998) proposed linear time complexity methods to generate correlated binary variables with three common correlation structures. However, it is infeasible to specify unequal probabilities in their methods. In this manuscript, we introduce several computationally efficient algorithms that generate high-dimensional binary data with specified correlation structures and unequal probabilities. Our algorithms have linear time complexity with respect to the dimension for three commonly studied correlation structures, namely exchangeable, decaying-product and K-dependent correlation structures. In addition, we extend our algorithms to generate binary data of general non-negative correlation matrices with quadratic time complexity. We provide an R package, CorBin, to implement our simulation methods. Compared to the existing packages for binary data generation, the time cost to generate a 100-dimensional binary vector with the common correlation structures and general correlation matrices can be reduced up to $10^5$ folds and $10^3$ folds, respectively, and the efficiency can be further improved with the increase of dimensions. The R package CorBin is available on CRAN at https://cran.r-project.org/.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源