Processing GPS Trajectory (Crumbs) on Roads inside any Database

In this post, we will show how easy it is to work with GPS trajectories using our technology. GPS devices are becoming commonplace and are now deployed on many different commercial and non-commercial vehicles. A company operating a fleet of taxi usually has a GPS installed in all of their vehicles. This gives the operator an exact idea of the location of their vehicle, which is very useful. For instance, when a customer requests a ride, the  taxi operator will use the current location of a vehicle to send the nearest vehicle to the customer but this is the most obvious customer facing application of GPS. There are more complex analysis that an operator will want to perform from historical (i.e., a day/week/month/year) worth of GPS information collected from vehicles which can shed light several aspects of their businesses.

An operator with historical GPS record can ask many interesting questions from the dataset.  For instance, a taxi operator can ask the following questions:

  • What is the average taxi ride?
  • How far on average does a taxi has to travel before it gets a ride?
  • How much on average does a taxi travel per day (or month)?
  • Are my drivers taking circuitous routes to run up the meter?
  • Which taxi in my fleet travelled the most distance?
  • Is there a taxi with a broken GPS device?

In the following blog post, we show how these queries can be written in a few minutes using our technology. All the operator has to do is to store the dataset in a database (actually any database but we will use Postgres) and can then write all these queries in a few minutes.


Our taxi Dataset is from obtained from San Francisco Yellow Can which provide the GPS information for 500 of their cabs using a publicly available website (  We use the dataset collected by these good folks at Dartmouth ( They have collected one month worth of dataset for 500 taxis and made it available online for download.

We obtained this dataset and made a few small changes to it so that we can load it into Postgresql database. The format of each data point is as follows:

  • id => crumb_id (unique key, we added)
  • taxiid => each taxi has a unique id
  • tripid => Globally unique trip id (unique id for trip which we added)
  • lat => latitude in degrees
  • lon => longitude in degrees
  • occupancy => shows if a cab has a fare (1 = occupied, 0 = free)
  • ts => time is in UNIX epoch format when this tuple was recorded

An example tuple is as follows: [id, taxiid, tripid, lat, lon, occupancy, time], e.g.: [112133, 1, 422, 37.75134, -122.39488, 0, 1213084687]. A taxi periodically stores a GPS point on the map. One assumes that the taxi took the shortest path from one GPS crumb the next. So reconstructing the trip would involve ordering the points by their timestamp (ts) (or equivalently by their id since we assign id by increasing timestamp), obtaining the distance between successive points and adding up the distance.

We first create a taxi table with the following schema and load the dataset.

  id        bigint,
  taxiid    int,
  tripid    bigint,
  lat       float,
  lon       float,
  occupancy int,
  ts        bigint

Next, we convert the latitude longitude to a code by adding an additional attribute called “code” by applying the Z2 function.

    alter table taxi add column code bigint;
    update taxi set code = z2(lat, lon);

We create a few standard B-tree indices on the table

     CREATE INDEX taxi_taxiid on taxi(taxiid);
     CREATE INDEX taxi_tripid on taxi(tripid);
     CREATE INDEX taxi_latlon on taxi(lat,lon);
     CREATE INDEX taxi_occupancy on taxi (taxiid, tripid) where occupancy = 1;
     CREATE INDEX taxi_ts ON taxi (ts);

Function to Compute Trip Distance

Trip distance is the distance of a trip taken by a customer. It involves a few steps:

  1. Extract all trajectory points of a given trajectory denoted by tripid
  2. Create an ordering of the points
  3. Compute road network distance between consecutive points
  4. Add the distances to produce the network distance of the trajectory

We now explain the SQL queries for each of these steps and final provide the final function that takes a tripid as input and produces the network distance of the corresponding trajectory.

We create a simple function called trip to extract all the points corresponding to a given tripid. It is a simple select on the taxi table.

CREATE OR REPLACE FUNCTION trip (which_trip_id bigint)
    id bigint,
    taxiid int,
    tripid bigint,
    lat float,
    lon float,
    occupancy int,
    ts bigint,
    code bigint
         RETURN QUERY SELECT, taxi.taxiid, taxi.tripid,, taxi.lon,
                             taxi.occupancy, taxi.ts, taxi.code
         FROM taxi
         WHERE taxi.tripid = which_trip_id;
$$ LANGUAGE plpgsql;

Next, we order the points such that point “id” is followed by “id +1” (one can do this without the aid of the id field by simply ordering points by increasing timestamp)

   SELECT, t1.code,, t2.code, dist(t1.code, t2.code)
     FROM trip(500) AS t1, trip(500) AS t2
   WHERE = - 1;

Then we compute the length of a trip by simply adding the network distance for each segment of the trajectory using the dist function.

           SELECT SUM(dist(t1.code, t2.code))
             FROM trip(500) AS t1, trip(500) AS t2
           WHERE = - 1;

This whole process can be packaged as a tripdist function that takes a tripid as input and produces the network distance of the trip.

CREATE OR REPLACE FUNCTION tripdist (which_trip_id bigint)
         SELECT sum(dist(t1.code, t2.code))
           FROM trip(which_trip_id) AS t1, trip(which_trip_id) AS t2
         WHERE = - 1;
  $$ LANGUAGE plpgsql;

To create a useful function that given a tripid produces an ordered set (taxiid, tripid, occupancy, tripdist). This way a whole trajectory of all the points is collapsed to taxiid, tripid, occupancy and the network distance (tripdist).

CREATE OR REPLACE FUNCTION tripinfo (which_trip_id bigint)
      taxiid int,
      tripid bigint,
      occupancy int,
      tripdist float
          SELECT taxi.taxiid, taxi.tripid, taxi.occupancy, tripdist(taxi.tripid)
               FROM taxi
          WHERE taxi.tripid = which_trip_id
          GROUP BY taxi.taxiid, taxi.tripid, taxi.occupancy;
$$ LANGUAGE plpgsql;


List of all trips made by a taxi

   SELECT tripinfo(tripid) FROM taxi where taxiid = 1;

How many KMs did each of the cars travel with passengers?

Change occupancy to 0 for the distance taxi’s travel without passengers. This query takes less than a second (about 700 ms) for taxiid = 1.

     SELECT sum(foo.tripdist)/1000 as distance_in_km FROM  -- Divide by 1000 to get KMs
        (SELECT (tripinfo(tripid)).tripdist as tripdist
          FROM taxi WHERE taxiid = 1 and occupancy = 1
          GROUP BY tripid) as foo;

Average distance they travel without passengers?

Average trip length is 7250 meters and it takes 197 seconds.

  SELECT avg(tripdist) FROM
     (SELECT tripdist(tripid) as tripdist FROM
           (SELECT tripid from taxi
              WHERE occupancy = 1
              GROUP BY tripid) as foo
     ) as foo1;

Order all of the 500 taxis the distance in Kms they covered in the month

This query computes the network road distance for all the trajectories for all the 500 taxis and then orders them in a decreasing order of distance. This query takes 295 seconds.

   SELECT taxiid, sum(tripdist)/1000 as distance_in_km FROM (
      SELECT taxiid, (tripinfo(tripid)).tripdist as tripdist
        FROM taxi
        WHERE occupancy = 1
        GROUP BY taxiid, tripid
        ) AS foo
   GROUP BY taxiid
   ORDER BY distance_in_km desc;

Looking at the result immediate reveals problems with two of the Taxis. While all the 498k taxis travelled less than 10k kms in a month, the top two taxi 518 and 494 had over 92k and 46k kms which seems incorrect.
In order to further investigate this we issue the following query.

SELECT (tp).tripid, (tp).tripdist FROM
 (SELECT tripinfo(tripid) AS tp FROM taxi WHERE taxiid = 518) AS foo
  ORDER BY (tp).tripdist DESC;

There are couple of issues with this taxi. There are a bunch of tripids with tripdist as NULL. We found two reasons for this. The GPS points were being put deep inside the pacific ocean or the trajectory only contained just one point and hence we could not compute the distance.

One trip id is interesting tripid = 894359 has a network distance of 1579301.4 meters. This is huge. We found that the reason for that is the the GPS keeps putting points that bounces between palo alto and near golden gate bridge, about 50 kms apart. The query we used is given here.

         SELECT t1.*, t2.*, dist(t1.code, t2.code) AS segdist
           FROM trip(894359) AS t1, trip(894359) AS t2
         WHERE = - 1 ORDER BY segdist desc;

This example shows how our technology can be used to get rid of erroneous GPS measurements before analytics can be performed on the trajectory.

Did my cabie take a huge detour?

To see if the cabie has been taking detours, we need to first figure out how to compute the direct distance between the first and last points in the trajectory. We define the following function, tripdist_srcdst.

CREATE OR REPLACE FUNCTION tripdist_srcdst (which_trip_id bigint)
     tripdist float
          SELECT dist(t1.code, t2.code) FROM
              (SELECT * FROM trip(which_trip_id) as t order by (t).id LIMIT 1) as t1,
              (SELECT * FROM trip(which_trip_id) as t order by (t).id DESC LIMIT 1) as t2;
$$ LANGUAGE plpgsql;

Now we can compute the difference between direct from source to dest vs. driver’s routing. For instance, we can compute the detour for tripid = 500. In this case, the direct distance 2104.6 vs. the cabbie’s route is 2325.1 meters. In other words, driver drove 200 meters more.

    SELECT tripdist_srcdst(500), tripdist(500);

This begs the question, are there some cabbies take a lot of circuitous routes. In particular, we are not interested in 200 meter detours but rather routes that are more than 5 kms longer than the direct path from source to destination. We use the following query to obtain the top 100 routes with the maximum detour. We limit the trips to those that are 50 kms.

 SELECT taxiid, tripid, tpsd, tp, (tp - tpsd) as diff FROM
        (SELECT tripdist(tripid) as tp, tripdist_srcdst(tripid) as tpsd, taxiid, tripid
           FROM taxi
           WHERE occupancy = 1
           GROUP BY taxiid, tripid
         ) as foo
     WHERE tp is not null AND tpsd is not null AND tp < 50 * 1000 AND tpsd < 50 * 1000
  LIMIT 100;

We immediately realize by looking at the output of this query is that cabbies make a of
loops (the taxi makes a huge loop and comes back
to the same point. Just like airport shuttles). So which are the taxis that detour too much, what is the average detour?

  SELECT taxiid, count(*) as num_trips, avg(tp-tpsd) as average_detour FROM
      (SELECT tripdist(tripid) as tp, tripdist_srcdst(tripid) as tpsd, taxiid, tripid
         FROM taxi
         WHERE occupancy = 1
         GROUP BY taxiid, tripid
       ) as foo
   WHERE tp is not null and tpsd is not null and tp < 50 *1000 and tpsd < 50 * 1000 AND tp >= tpsd
   GROUP BY taxiid
   ORDER BY average_detour desc;

The result of this query is interesting. Taxi 518 had 1082 trips and had a average detour of 14 Kms. But we know that this is the taxi with a buggy GPS. Now Taxi 19 made 594 trips 2.978 kms detour, which is roughly 1800 kms more than direct routes. May be this cabbie does a lot of carpool rides. Something to examine more.

Note that such a free form exploration of spatial datasets on road network is only possible since it is so simple to express queries using SQL inside a database. Imagine if for each of the query you had to program quite a bit to get answers.

Where are the nearest taxi?

Finding a taxi involves finding a taxi that is finding the closest in terms of location and also in time. In a live system the time is the current epoch timestamp. The following query finds the nearest taxi with hard coded location (37.75, -122.4) for the customer and time when the customer requested (1211840888). This query takes 145 ms.

SELECT * from (
        SELECT taxiid, last(occupancy) as taxi_occupancy,
               last(lat) as taxi_lat, last(lon) as taxi_lon,
               last(code) as taxi_code, dist(last(code), z2(37.75, -122.4)) as taxi_distance
          FROM (
               SELECT * FROM taxi WHERE
                   ts BETWEEN 1211840888 - 10 * 60 and 1211840888 -- Extract taxi's position with 10 minute tim window
               ORDER BY TS) as foo
            GROUP BY taxiid) as foo1
   WHERE foo1.taxi_occupancy = 0 -- Choose that are currently available
   ORDER BY taxi_distance -- Order by distance to where location is made
   LIMIT 10;

Want to try this demo?

Try our article here for the oracle and the datasets.