Similarity Classification of Public Transit Stations

12/30/2020
by   Hannah Bast, et al.
0

We study the following problem: given two public transit station identifiers A and B, each with a label and a geographic coordinate, decide whether A and B describe the same station. For example, for "St Pancras International" at (51.5306, -0.1253) and "London St Pancras" at (51.5319, -0.1269), the answer would be "Yes". This problem frequently arises in areas where public transit data is used, for example in geographic information systems, schedule merging, route planning, or map matching. We consider several baseline methods based on geographic distance and simple string similarity measures. We also experiment with more elaborate string similarity measures and manually created normalization rules. Our experiments show that these baseline methods produce good, but not fully satisfactory results. We therefore develop an approach based on a random forest classifier which is trained on matching trigrams between two stations, their distance, and their position on an interwoven grid. All approaches are evaluated on extensive ground truth datasets we generated from OpenStreetMap (OSM) data: (1) The union of Great Britain and Ireland and (2) the union of Germany, Switzerland, and Austria. On all datasets, our learning-based approach achieves an F1 score of over 99 elaborate baseline approach (based on TFIDF scores and the geographic distance) achieves an F1 score of at most 94 geographical distance threshold achieves an F1 score of only 75 training and testing datasets are publicly available.

READ FULL TEXT

page 2

page 6

research
12/28/2019

Tha3aroon at NSURL-2019 Task 8: Semantic Question Similarity in Arabic

In this paper, we describe our team's effort on the semantic text questi...
research
03/29/2016

ROOT13: Spotting Hypernyms, Co-Hyponyms and Randoms

In this paper, we describe ROOT13, a supervised system for the classific...
research
02/08/2014

Thresholding Classifiers to Maximize F1 Score

This paper provides new insight into maximizing F1 scores in the context...
research
11/30/2021

Automatic tracing of mandibular canal pathways using deep learning

There is an increasing demand in medical industries to have automated sy...
research
02/09/2022

GenAD: General Representations of Multivariate Time Seriesfor Anomaly Detection

The reliability of wireless base stations in China Mobile is of vital im...
research
07/13/2020

Landslide Segmentation with U-Net: Evaluating Different Sampling Methods and Patch Sizes

Landslide inventory maps are crucial to validate predictive landslide mo...

Please sign up or login with your details

Forgot password? Click here to reset