用于建模大型非高斯空间数据的灵活基础表示

论文标题

用于建模大型非高斯空间数据的灵活基础表示

Flexible Basis Representations for Modeling Large Non-Gaussian Spatial Data

论文作者

MacDonald, Remy, Lee, Benjamin Seiyon

论文摘要

非组织和非高斯空间数据在各个领域都很常见，包括生态学（例如动物物种计数），流行病学（例如，易感区域的疾病发病率计数）和环境科学（例如，远程感应的卫星成像）。由于现代数据收集方法，这些数据集的大小已大大增加。空间广义线性混合模型（SGLMMS）是一种灵活的模型类，用于建模非组织和非高斯数据集。尽管它们的实用性，SGLMMS对于甚至中等大的数据集（例如，观察到的位置5,000至100,000个）也可能在计算上非常过分。为了解决这个问题，过去的研究将嵌套的径向基函数嵌入了SGLMM中。但是，直接影响模型性能的两个至关重要的规格（结位和带宽参数）通常是在模型拟合之前固定的。我们提出了一种新颖的方法，以使用自适应径向基函数来对大型非平稳和非高斯空间数据集进行建模。我们的方法：（1）将空间领域分为子区域；（2）采用可逆的马尔可夫链蒙特卡洛（RJMCMC）来推断每个分区内结的数量和位置；（3）使用分区变化和自适应基函数对潜在空间表面进行建模。通过广泛的仿真研究，我们表明我们的方法提供了比竞争方法更准确的预测，同时保持计算效率。我们在两个环境数据集上展示了我们的方法 - 在美国的植物物种的发生率和鸟类数量。

Nonstationary and non-Gaussian spatial data are common in various fields, including ecology (e.g., counts of animal species), epidemiology (e.g., disease incidence counts in susceptible regions), and environmental science (e.g., remotely-sensed satellite imagery). Due to modern data collection methods, the size of these datasets have grown considerably. Spatial generalized linear mixed models (SGLMMs) are a flexible class of models used to model nonstationary and non-Gaussian datasets. Despite their utility, SGLMMs can be computationally prohibitive for even moderately large datasets (e.g., 5,000 to 100,000 observed locations). To circumvent this issue, past studies have embedded nested radial basis functions into the SGLMM. However, two crucial specifications (knot placement and bandwidth parameters), which directly affect model performance, are typically fixed prior to model-fitting. We propose a novel approach to model large nonstationary and non-Gaussian spatial datasets using adaptive radial basis functions. Our approach: (1) partitions the spatial domain into subregions; (2) employs reversible-jump Markov chain Monte Carlo (RJMCMC) to infer the number and location of the knots within each partition; and (3) models the latent spatial surface using partition-varying and adaptive basis functions. Through an extensive simulation study, we show that our approach provides more accurate predictions than competing methods while preserving computational efficiency. We demonstrate our approach on two environmental datasets - incidences of plant species and counts of bird species in the United States.

下载PDF全文

下载文献需遵守相关版权规定

论文标题