Zooming in on NYC taxi data with Portal
In this paper we develop a methodology for analyzing transportation data at different levels of temporal and geographic granularity, and apply our methodology to the TLC Trip Record Dataset, made publicly available by the NYC Taxi & Limousine Commission. This data is naturally represented by a set of trajectories, annotated with time and with additional information such as passenger count and cost. We analyze TLC data to identify hotspots, which point to lack of convenient public transportation options, and popular routes, which motivate ride-sharing solutions or addition of a bus route. Our methodology is based on using a system called Portal, which implements efficient representations and principled analysis methods for evolving graphs. Portal is implemented on top of Apache Spark, a popular distributed data processing system, is inter-operable with other Spark libraries like SparkSQL, and supports sophisticated kinds of analysis of evolving graphs efficiently. Portal is currently under development in the Data, Responsibly Lab at Drexel. We plan to release Portal in the open source in Fall 2017.
READ FULL TEXT