论文标题

与分类协变量的大数据的平衡子采样

Balanced Subsampling for Big Data with Categorical Covariates

论文作者

Wang, Lin

论文摘要

在统计和机器学习中,根据测量约束的监督学习是一个普遍的挑战。在许多应用中,尽管设计要点很广,但由于资源限制,获得所有点的响应通常是不切实际的。亚采样算法通过从设计点从设计点选择一个观察响应的子集来提供解决方案。现有的子采样方法主要假设数值预测指标,从而忽略了在各个学科中具有分类预测指标的大数据的普遍发生。本文提出了一种针对具有分类预测指标的数据定制的新型平衡亚采样方法。平衡的子样本显着降低了观察反应的成本,并具有三个所需的优点。首先,它是非源头,因此允许使用从分类预测变量编码的所有虚拟变量进行线性回归。其次,它通过最小化估计参数的广义差异来提供最佳参数估计。第三,它允许在最小化最坏情况的预测误差的意义上进行健壮的预测。我们通过广泛的仿真研究和现实世界的应用证明了平衡亚采样比现有方法的优越性。

Supervised learning under measurement constraints is a common challenge in statistical and machine learning. In many applications, despite extensive design points, acquiring responses for all points is often impractical due to resource limitations. Subsampling algorithms offer a solution by selecting a subset from the design points for observing the response. Existing subsampling methods primarily assume numerical predictors, neglecting the prevalent occurrence of big data with categorical predictors across various disciplines. This paper proposes a novel balanced subsampling approach tailored for data with categorical predictors. A balanced subsample significantly reduces the cost of observing the response and possesses three desired merits. First, it is nonsingular and, therefore, allows linear regression with all dummy variables encoded from categorical predictors. Second, it offers optimal parameter estimation by minimizing the generalized variance of the estimated parameters. Third, it allows robust prediction in the sense of minimizing the worst-case prediction error. We demonstrate the superiority of balanced subsampling over existing methods through extensive simulation studies and a real-world application.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源