Step 0: Obtain data

The data that we will be using in this tutorial is taxi data from New York. This data contains the pick-up and drop-off locations and times, as well as additional trip information. For example, it contains the trip distance, the passenger count, …​ .

A collection of CSV files with data can be found at the website of the New York taxi and limousine commission. For this tutorial, it is sufficient if you download a single CSV file.

The CSV files on that website contain some invalid records, for example an unrealistically high passenger count (unless you can really fit a few 1000 people into a single taxi). For convenience, we prepared a cleaned up file which you can download here.

The CSV file only contains the identifiers of the pick-up and drop-off locations. The actual locations are available as a gzipped GeoJSON file which you can download here. (Note: this GeoJSON file was converted from the original SHP file which you can find here).

While the data downloads, you can already continue with the next part where you will create the new data set.

Next part

Go to the next part: Step 1: Create data set