论文标题
公共交通站的相似性分类
Similarity Classification of Public Transit Stations
论文作者
论文摘要
我们研究以下问题:给定两个公共交通站标识符A和B,每个标识符A和B带有标签和地理坐标,请确定A和B是否描述了同一站。例如,对于(51.5306,-0.1253)和“伦敦圣潘克拉斯”(51.5319,-0.1269)的“ St Pancras International”(51.5306,-0.1253),答案将是“是”。此问题经常出现在使用公共交通数据的领域,例如在地理信息系统中,安排合并,路由计划或地图匹配。我们考虑了基于地理距离和简单的字符串相似性度量的几种基线方法。我们还尝试了更精细的字符串相似性度量,并手动创建了归一化规则。我们的实验表明,这些基线方法会产生良好的结果,但并非完全令人满意的结果。因此,我们开发了一种基于随机森林分类器的方法,该方法经过培训,该方法是针对两个站点之间匹配的trigram培训的,它们的距离以及它们在交织的网格上的位置。所有方法均在我们从OpenStreetMap(OSM)数据中生成的广泛的地面真相数据集进行评估:(1)英国和爱尔兰的联盟以及(2)德国,瑞士和奥地利的联盟。在所有数据集中,我们基于学习的方法的F1得分超过99%,而即使是最精致的基线方法(基于TFIDF分数和地理距离),F1得分最多达到94%,并且使用地理距离阈值的F1得分仅为75%。我们的培训和测试数据集均可公开使用。
We study the following problem: given two public transit station identifiers A and B, each with a label and a geographic coordinate, decide whether A and B describe the same station. For example, for "St Pancras International" at (51.5306, -0.1253) and "London St Pancras" at (51.5319, -0.1269), the answer would be "Yes". This problem frequently arises in areas where public transit data is used, for example in geographic information systems, schedule merging, route planning, or map matching. We consider several baseline methods based on geographic distance and simple string similarity measures. We also experiment with more elaborate string similarity measures and manually created normalization rules. Our experiments show that these baseline methods produce good, but not fully satisfactory results. We therefore develop an approach based on a random forest classifier which is trained on matching trigrams between two stations, their distance, and their position on an interwoven grid. All approaches are evaluated on extensive ground truth datasets we generated from OpenStreetMap (OSM) data: (1) The union of Great Britain and Ireland and (2) the union of Germany, Switzerland, and Austria. On all datasets, our learning-based approach achieves an F1 score of over 99%, while even the most elaborate baseline approach (based on TFIDF scores and the geographic distance) achieves an F1 score of at most 94%, and a naive approach of using a geographical distance threshold achieves an F1 score of only 75%. Both our training and testing datasets are publicly available.