论文标题
使用机器学习使用位置大数据评估房地产价格
Using Machine Learning to Evaluate Real Estate Prices Using Location Big Data
论文作者
论文摘要
如今,每个人都试图进入房地产市场,了解住宅和商业物业的适当估值已变得至关重要。众所周知,过去的研究人员使用静态房地产数据(例如床,浴室,平方英尺),甚至是房地产和人口统计信息的组合来预测房地产价格。在这项调查中,我们试图改善过去的研究。因此,我们决定探索一种独特的方法:我们想确定移动位置数据是否可以用于提高流行回归和基于树的模型的预测能力。为了为模型准备我们的数据,我们通过将其附加到个人属性的房地产数据中来处理移动性数据,该数据将用户汇总到一周中每一天的500米以内的用户。我们删除了居住在每个物业500米以内的人,因此每个物业的汇总移动性数据仅包含非居民普查功能。除了这些动态的人口普查功能外,我们还包括静态普查功能,包括该地区的人数,通勤人数的平均比例以及该地区的居民数量。最后,我们测试了多种模型以预测房地产价格。我们提出的模型是使用脊回归的两个堆叠的随机森林模块,该模块使用随机森林输出作为预测因子。第一个随机森林模型仅使用静态特征,第二个随机森林模型仅使用动态特征。将我们的模型具有和没有动态移动位置功能进行比较,可以结论具有动态移动位置功能的模型比同一模型低3/%的平均平方误差,但没有动态移动位置功能。
With everyone trying to enter the real estate market nowadays, knowing the proper valuations for residential and commercial properties has become crucial. Past researchers have been known to utilize static real estate data (e.g. number of beds, baths, square footage) or even a combination of real estate and demographic information to predict property prices. In this investigation, we attempted to improve upon past research. So we decided to explore a unique approach: we wanted to determine if mobile location data could be used to improve the predictive power of popular regression and tree-based models. To prepare our data for our models, we processed the mobility data by attaching it to individual properties from the real estate data that aggregated users within 500 meters of the property for each day of the week. We removed people that lived within 500 meters of each property, so each property's aggregated mobility data only contained non-resident census features. On top of these dynamic census features, we also included static census features, including the number of people in the area, the average proportion of people commuting, and the number of residents in the area. Finally, we tested multiple models to predict real estate prices. Our proposed model is two stacked random forest modules combined using a ridge regression that uses the random forest outputs as predictors. The first random forest model used static features only and the second random forest model used dynamic features only. Comparing our models with and without the dynamic mobile location features concludes the model with dynamic mobile location features achieves 3/% percent lower mean squared error than the same model but without dynamic mobile location features.