Step 2: Define the structure of the data

At this point, you have created a new, empty data set. It does not contain any data yet, nor knows the platform anything about the structure and properties of the data.

First, you will define the structure of your .csv files:

  • The platform must know in which columns the location is stored (longitude, latitude and optional height).

  • The platform must know in which column the timestamp is stored.

  • If you want to analyze any additional properties, you also need to indicate in which columns those properties are stored and whether they are numbers, categorical properties (enumerations), or free strings.

Step 2.1: Navigate to the Configure Data Properties UI

by clicking on the Configure Data Properties button in the navigation bar

Figure 1. The navigation bar after clicking on the Configure Data Properties button.

Step 2.2: Gather information about your data set

Let us first understand the structure of the data that we will upload.

The AIS data that you are using in this tutorial is described here. The relevant information for us is:

  • The different columns in the CSV file: MMIS, BaseDateTime, LAT, LON, SOG, COG, Heading, VesselName, IMO, CallSign, VesselType, Status, Length, Width, Draft, Cargo, TransceiverClass

  • The EPSG reference system: EPSG:4269 . This tells us how we should interpret the location coordinates.

There is also an image with an overview of information and type of the different columns:

Figure 2. Overview of the different columns in the AIS data (source).

Step 2.3: Use the wizard to configure the required properties

Now that you have all the information about the data set, you can provide this info to the platform. This tutorial uses the wizard for this.

Right now, the wizard contains the required properties:

  • The identifier column: each row in the .csv files represents a recorded location of a ship. The platform can associate that location to a specific ship by looking at the unique ship identifier that is present in each row.

  • The location columns: each row contains a longitude and latitude coordinate. The platform needs to know in which columns these are stored.

  • The timestamp column: each location is recorded at a specific time. The platform needs to know which column contains that timestamp.

Figure 3. The wizard with all the required properties that still need to be configured
The wizard has context-sensitive help messages

The info box on the right-hand side of the wizard contains some additional information.

This information updates based on the property you are currently editing.

Identifier

The identifier in our data set is the MMSI (Maritime Mobile Service Identity) value. As listed in our data schema, this value is stored in the first column as a string.

Click on the Identifier button in the wizard and add that information to the correct fields:

Figure 4. The wizard with all the information for the Identifier property completed.

Location

Follow the same steps for the location.

  • The Latitude is stored in column 3 as a double.

  • The Longitude is stored in column 4, also as a double.

  • The reference system is EPSG:4269.

It is possible that the wizard will show an error stating that the column index is already used for another column.

Figure 5. The wizard with all information about the Location, but showing an error.

This is because you have not configured the Timestamp property yet, and right now it is configured to be stored in column 4 as well. You can ignore the warning for now, as you will fix this once you configure the timestamp.

Timestamp

Repeat once more for the timestamp, which is stored in column 2. As soon as you indicate that the times are stored in column 2, the aforementioned error disappears.

Figure 6. The wizard with the timestamp configured, and the error fixed.

The times are stored using a standard string representation (YYYY-MM-DD:HH-MM-SS), so there is no need to define a custom pattern.

Step 2.4: Use the wizard to configure additional properties

If you stopped here, you will only have access to the location and timestamps during analysis. Most likely, you will want to make some additional properties available.

For this tutorial, you can for example add:

  • The name of the vessel: available in column 8 as a string.

  • The vessel type: available in column 11. It is stored as a number, but those numbers represent an enumeration. For example number 30 means a fishing vessel, number 36 a sailing vessel, …​ (see here for an overview).

  • The length of the vessel: stored in column 13 as a float.

More details about the type of the column available

If you are unsure what the data type for your column is, consult the help on the right-hand side of the wizard.

We also have an additional article with more guidance available.

The name

Press the ADD PROPERTY button at the bottom of the wizard, and fill in the details for this new property.

Figure 7. The wizard with the name column configured.

The vessel type

Press the ADD PROPERTY button again, and repeat. Only this time choose enum as data type for the column.

Figure 8. The wizard with the vessel type column configured.

The length

Press the ADD PROPERTY button one last time, and this time fill in the details for the length property.

Figure 9. The wizard with the length column configured.

As this is a numeric property, you also need to define an aggregation interval, as shown in the screenshot.

The analytics engine maps numeric properties to bins or intervals. For example if you define the aggregation interval for this length property to be 2, the analytics engine will report statistics on this length in intervals of 2 meters.

You will then be able to tell whether a vessel had a length in your original data between 0 and 2 meters, or between 2 and 4 meters, …​ . You will not be able to distinguish between vessels with a length of 4 and 5 meters, because both their lengths are mapped to the "between 4 and 6 meters" interval.

Figure 10. Illustration how an aggregation interval of 2 meters will affect the reported lengths.

A good choice for the aggregation interval depends on two things:

  • The accuracy you want to have available during the analysis.

  • The accuracy that is available in the data: if the vessel length in the source data would only be reported with an accuracy of 10 meters, there is no point in choosing a smaller aggregation interval.

Using larger intervals is beneficial for the performance and response times of the platform. For smaller data sets however, the impact of choosing a small interval is negligible.

Step 2.5: Save the configuration

Now that you have filled in all the properties you want to have available for analysis, you still have to press the save button at the bottom of the wizard to save this configuration.

After you have saved the configuration, a table showing the properties of your data will appear underneath the wizard:

Figure 11. The properties of your source data, displayed underneath the wizard.

There are some things to note here:

  • The data properties overview shows more properties than you have configured. This is because you have indicated that the length property is contained in column 13. Now the platform knows your input data has at least 13 columns. The columns which you didn’t configure are present in the overview, but will be ignored.

  • At this point, it is still possible to change the properties. For example if you realize you made a mistake, you can still correct it. Once you start uploading data, it is no longer possible to make changes to the data structure.

Other ways of defining your data

In this tutorial we use the wizard to define the structure of the data. You can also define this in a separate file (in .csv format) and upload it, avoiding the use of the wizard. Or you can re-use a previously defined configuration.

This is explained in more detail here.

Next part