# Exact K-Nearest Neighbors

In this post we show how easy it is to find the exact K-nearest neighbors for a large number of points on a road network. In this example we use a GiST spatial in Postgres. This example can also be done using without a GiSTindex. Our dataset consists of 49,573 fast food restaurants obtained from http://fastfoodmaps.com/data.html.

## Dataset

The schema of the relation and the index on it are as follows:

```CREATE TABLE restz4
(
id integer,
lat double precision,
lon double precision,
code bigint
);

UPDATE restz4 set code = Z2(lat, lon);
CREATE INDEX id_index_key ON public.restz4 USING btree (id);
CREATE INDEX code_index_key ON public.restz4 USING btree (code);
CREATE INDEX lat_lon_index_key ON public.restz4 USING btree (lat, lon);

-- Create GiST Index
ALTER TABLE restz4 ADD COLUMN gid serial PRIMARY KEY;
ALTER TABLE restz4 ADD COLUMN geom geometry(POINT,4326);
UPDATE restz4 SET geom = ST_SetSRID(ST_MakePoint(lon,lat),4326);
CREATE INDEX idx_restz4_geom ON restz4 USING GIST(geom);
CREATE INDEX idx_rest4_geom ON public.restz4 USING gist (geom);
```

## How to compute exact neighbors?

The strategy to computing the exact K-nearest neighbors is as follows:

• Use the GiST spatial index index to find the K-spatial nearest neighbors. Spatial distance here refers to geodesic or “crow-flying” distance.
• For each point, look at the kth spatial neighbor and compute its network distance. Let this distance be denoted by “dist”
• Obtain all points that lie within a distance of “dist” around the point. A quick dirty way of doing is by performing a window query around each point based on a range on the latitude and longitude. Since one 1 degree of latitude/longitude corresponds to 110kms, we simply divide dist by 110,000 and use it as the side length of a box around each point
• Compute the actual network distance of all points within this box and choose the K closest.
• This method is correct since the spatial distance is a lower-bound of the network distance. We are guaranteed to find the K actual network neighbors within a distance of dist.

## SQL Query

This query finds the 10 neighbors for each of the roughly 50k restaurants among nearby restaurants.

```SELECT foo1.id as id1, restz4.id as id2,
dist(foo1.code, restz4.code) as networkdist
FROM (
SELECT * FROM (
SELECT y.id as id, y.lat as lat, y.lon as lon, y.code as code, (
SELECT dist(x.code, y.code) / 110000 as dist -- 110km is roughly 1 degree lat/lon
FROM restz4 x
WHERE x.gid != y.gid
ORDER BY x.geom &lt;-&gt; st_setsrid(y.geom, 4326)
OFFSET 10   -- Value of K: Find the 10 spatial neighbor
LIMIT 1
)
FROM restz4 y
) AS FOO
) AS FOO1, restz4
WHERE
restz4.lat between FOO1.lat - dist and FOO1.lat + dist AND
restz4.lon between FOO1.lon - dist and FOO1.lon + dist AND
foo1.id != restz4.id;
```

## Performance

When k = 10, it takes 166 seconds to compute 10 neighbors for each of the nearly 50k restaurants. It makes 1.02 million network distance computations on the road network in the process of computing the nearest neighbors.

# Processing GPS Trajectory (Crumbs) on Roads inside any Database

In this post, we will show how easy it is to work with GPS trajectories using our technology. GPS devices are becoming commonplace and are now deployed on many different commercial and non-commercial vehicles. A company operating a fleet of taxi usually has a GPS installed in all of their vehicles. This gives the operator an exact idea of the location of their vehicle, which is very useful. For instance, when a customer requests a ride, the  taxi operator will use the current location of a vehicle to send the nearest vehicle to the customer but this is the most obvious customer facing application of GPS. There are more complex analysis that an operator will want to perform from historical (i.e., a day/week/month/year) worth of GPS information collected from vehicles which can shed light several aspects of their businesses.

An operator with historical GPS record can ask many interesting questions from the dataset.  For instance, a taxi operator can ask the following questions:

• What is the average taxi ride?
• How far on average does a taxi has to travel before it gets a ride?
• How much on average does a taxi travel per day (or month)?
• Are my drivers taking circuitous routes to run up the meter?
• Which taxi in my fleet travelled the most distance?
• Is there a taxi with a broken GPS device?

In the following blog post, we show how these queries can be written in a few minutes using our technology. All the operator has to do is to store the dataset in a database (actually any database but we will use Postgres) and can then write all these queries in a few minutes.

# Dataset

Our taxi Dataset is from obtained from San Francisco Yellow Can which provide the GPS information for 500 of their cabs using a publicly available website (http://cabspotting.org/api).  We use the dataset collected by these good folks at Dartmouth (http://crawdad.cs.dartmouth.edu/~crawdad/epfl/mobility/). They have collected one month worth of dataset for 500 taxis and made it available online for download.

We obtained this dataset and made a few small changes to it so that we can load it into Postgresql database. The format of each data point is as follows:

• id => crumb_id (unique key, we added)
• taxiid => each taxi has a unique id
• tripid => Globally unique trip id (unique id for trip which we added)
• lat => latitude in degrees
• lon => longitude in degrees
• occupancy => shows if a cab has a fare (1 = occupied, 0 = free)
• ts => time is in UNIX epoch format when this tuple was recorded

An example tuple is as follows: [id, taxiid, tripid, lat, lon, occupancy, time], e.g.: [112133, 1, 422, 37.75134, -122.39488, 0, 1213084687]. A taxi periodically stores a GPS point on the map. One assumes that the taxi took the shortest path from one GPS crumb the next. So reconstructing the trip would involve ordering the points by their timestamp (ts) (or equivalently by their id since we assign id by increasing timestamp), obtaining the distance between successive points and adding up the distance.

We first create a taxi table with the following schema and load the dataset.

```CREATE TABLE taxi (
id        bigint,
taxiid    int,
tripid    bigint,
lat       float,
lon       float,
occupancy int,
ts        bigint
);
```

Next, we convert the latitude longitude to a code by adding an additional attribute called “code” by applying the Z2 function.

```    alter table taxi add column code bigint;
update taxi set code = z2(lat, lon);
```

We create a few standard B-tree indices on the table

```     ALTER TABLE taxi ADD PRIMARY KEY (id);
CREATE INDEX taxi_taxiid on taxi(taxiid);
CREATE INDEX taxi_tripid on taxi(tripid);
CREATE INDEX taxi_latlon on taxi(lat,lon);
CREATE INDEX taxi_occupancy on taxi (taxiid, tripid) where occupancy = 1;
CREATE INDEX taxi_ts ON taxi (ts);
```

# Function to Compute Trip Distance

Trip distance is the distance of a trip taken by a customer. It involves a few steps:

1. Extract all trajectory points of a given trajectory denoted by tripid
2. Create an ordering of the points
3. Compute road network distance between consecutive points
4. Add the distances to produce the network distance of the trajectory

We now explain the SQL queries for each of these steps and final provide the final function that takes a tripid as input and produces the network distance of the corresponding trajectory.

We create a simple function called trip to extract all the points corresponding to a given tripid. It is a simple select on the taxi table.

```CREATE OR REPLACE FUNCTION trip (which_trip_id bigint)
RETURNS TABLE(
id bigint,
taxiid int,
tripid bigint,
lat float,
lon float,
occupancy int,
ts bigint,
code bigint
) IMMUTABLE AS
\$\$
BEGIN
RETURN QUERY SELECT taxi.id, taxi.taxiid, taxi.tripid, taxi.lat, taxi.lon,
taxi.occupancy, taxi.ts, taxi.code
FROM taxi
WHERE taxi.tripid = which_trip_id;
END;
\$\$ LANGUAGE plpgsql;
```

Next, we order the points such that point “id” is followed by “id +1” (one can do this without the aid of the id field by simply ordering points by increasing timestamp)

```   SELECT t1.id, t1.code, t2.id, t2.code, dist(t1.code, t2.code)
FROM trip(500) AS t1, trip(500) AS t2
WHERE t1.id = t2.id - 1;
```

Then we compute the length of a trip by simply adding the network distance for each segment of the trajectory using the dist function.

```           SELECT SUM(dist(t1.code, t2.code))
FROM trip(500) AS t1, trip(500) AS t2
WHERE t1.id = t2.id - 1;
```

This whole process can be packaged as a tripdist function that takes a tripid as input and produces the network distance of the trip.

```CREATE OR REPLACE FUNCTION tripdist (which_trip_id bigint)
RETURNS TABLE(tripdist float) IMMUTABLE AS
\$\$
BEGIN
RETURN QUERY
SELECT sum(dist(t1.code, t2.code))
FROM trip(which_trip_id) AS t1, trip(which_trip_id) AS t2
WHERE t1.id = t2.id - 1;
END;
\$\$ LANGUAGE plpgsql;
```

To create a useful function that given a tripid produces an ordered set (taxiid, tripid, occupancy, tripdist). This way a whole trajectory of all the points is collapsed to taxiid, tripid, occupancy and the network distance (tripdist).

```CREATE OR REPLACE FUNCTION tripinfo (which_trip_id bigint)
RETURNS TABLE(
taxiid int,
tripid bigint,
occupancy int,
tripdist float
) IMMUTABLE AS
\$\$
BEGIN
RETURN QUERY
SELECT taxi.taxiid, taxi.tripid, taxi.occupancy, tripdist(taxi.tripid)
FROM taxi
WHERE taxi.tripid = which_trip_id
GROUP BY taxi.taxiid, taxi.tripid, taxi.occupancy;
END;
\$\$ LANGUAGE plpgsql;
```

# Queries

## List of all trips made by a taxi

```   SELECT tripinfo(tripid) FROM taxi where taxiid = 1;
```

## How many KMs did each of the cars travel with passengers?

Change occupancy to 0 for the distance taxi’s travel without passengers. This query takes less than a second (about 700 ms) for taxiid = 1.

```     SELECT sum(foo.tripdist)/1000 as distance_in_km FROM  -- Divide by 1000 to get KMs
(SELECT (tripinfo(tripid)).tripdist as tripdist
FROM taxi WHERE taxiid = 1 and occupancy = 1
GROUP BY tripid) as foo;
```

## Average distance they travel without passengers?

Average trip length is 7250 meters and it takes 197 seconds.

```  SELECT avg(tripdist) FROM
(SELECT tripdist(tripid) as tripdist FROM
(SELECT tripid from taxi
WHERE occupancy = 1
GROUP BY tripid) as foo
) as foo1;
```

## Order all of the 500 taxis the distance in Kms they covered in the month

This query computes the network road distance for all the trajectories for all the 500 taxis and then orders them in a decreasing order of distance. This query takes 295 seconds.

```   SELECT taxiid, sum(tripdist)/1000 as distance_in_km FROM (
SELECT taxiid, (tripinfo(tripid)).tripdist as tripdist
FROM taxi
WHERE occupancy = 1
GROUP BY taxiid, tripid
) AS foo
GROUP BY taxiid
ORDER BY distance_in_km desc;
```

Looking at the result immediate reveals problems with two of the Taxis. While all the 498k taxis travelled less than 10k kms in a month, the top two taxi 518 and 494 had over 92k and 46k kms which seems incorrect.
In order to further investigate this we issue the following query.

```SELECT (tp).tripid, (tp).tripdist FROM
(SELECT tripinfo(tripid) AS tp FROM taxi WHERE taxiid = 518) AS foo
ORDER BY (tp).tripdist DESC;
```

There are couple of issues with this taxi. There are a bunch of tripids with tripdist as NULL. We found two reasons for this. The GPS points were being put deep inside the pacific ocean or the trajectory only contained just one point and hence we could not compute the distance.

One trip id is interesting tripid = 894359 has a network distance of 1579301.4 meters. This is huge. We found that the reason for that is the the GPS keeps putting points that bounces between palo alto and near golden gate bridge, about 50 kms apart. The query we used is given here.

```         SELECT t1.*, t2.*, dist(t1.code, t2.code) AS segdist
FROM trip(894359) AS t1, trip(894359) AS t2
WHERE t1.id = t2.id - 1 ORDER BY segdist desc;
```

This example shows how our technology can be used to get rid of erroneous GPS measurements before analytics can be performed on the trajectory.

## Did my cabie take a huge detour?

To see if the cabie has been taking detours, we need to first figure out how to compute the direct distance between the first and last points in the trajectory. We define the following function, tripdist_srcdst.

```CREATE OR REPLACE FUNCTION tripdist_srcdst (which_trip_id bigint)
RETURNS TABLE(
tripdist float
) IMMUTABLE AS
\$\$
BEGIN
RETURN QUERY
SELECT dist(t1.code, t2.code) FROM
(SELECT * FROM trip(which_trip_id) as t order by (t).id LIMIT 1) as t1,
(SELECT * FROM trip(which_trip_id) as t order by (t).id DESC LIMIT 1) as t2;
END;
\$\$ LANGUAGE plpgsql;
```

Now we can compute the difference between direct from source to dest vs. driver’s routing. For instance, we can compute the detour for tripid = 500. In this case, the direct distance 2104.6 vs. the cabbie’s route is 2325.1 meters. In other words, driver drove 200 meters more.

```    SELECT tripdist_srcdst(500), tripdist(500);
```

This begs the question, are there some cabbies take a lot of circuitous routes. In particular, we are not interested in 200 meter detours but rather routes that are more than 5 kms longer than the direct path from source to destination. We use the following query to obtain the top 100 routes with the maximum detour. We limit the trips to those that are 50 kms.

``` SELECT taxiid, tripid, tpsd, tp, (tp - tpsd) as diff FROM
(SELECT tripdist(tripid) as tp, tripdist_srcdst(tripid) as tpsd, taxiid, tripid
FROM taxi
WHERE occupancy = 1
GROUP BY taxiid, tripid
) as foo
WHERE tp is not null AND tpsd is not null AND tp &lt; 50 * 1000 AND tpsd &lt; 50 * 1000
ORDER BY diff DESC
LIMIT 100;
```

We immediately realize by looking at the output of this query is that cabbies make a of
loops (the taxi makes a huge loop and comes back
to the same point. Just like airport shuttles). So which are the taxis that detour too much, what is the average detour?

```  SELECT taxiid, count(*) as num_trips, avg(tp-tpsd) as average_detour FROM
(SELECT tripdist(tripid) as tp, tripdist_srcdst(tripid) as tpsd, taxiid, tripid
FROM taxi
WHERE occupancy = 1
GROUP BY taxiid, tripid
) as foo
WHERE tp is not null and tpsd is not null and tp &lt; 50 *1000 and tpsd &lt; 50 * 1000 AND tp &gt;= tpsd
GROUP BY taxiid
ORDER BY average_detour desc;
```

The result of this query is interesting. Taxi 518 had 1082 trips and had a average detour of 14 Kms. But we know that this is the taxi with a buggy GPS. Now Taxi 19 made 594 trips 2.978 kms detour, which is roughly 1800 kms more than direct routes. May be this cabbie does a lot of carpool rides. Something to examine more.

Note that such a free form exploration of spatial datasets on road network is only possible since it is so simple to express queries using SQL inside a database. Imagine if for each of the query you had to program quite a bit to get answers.

## Where are the nearest taxi?

Finding a taxi involves finding a taxi that is finding the closest in terms of location and also in time. In a live system the time is the current epoch timestamp. The following query finds the nearest taxi with hard coded location (37.75, -122.4) for the customer and time when the customer requested (1211840888). This query takes 145 ms.

```SELECT * from (
SELECT taxiid, last(occupancy) as taxi_occupancy,
last(lat) as taxi_lat, last(lon) as taxi_lon,
last(code) as taxi_code, dist(last(code), z2(37.75, -122.4)) as taxi_distance
FROM (
SELECT * FROM taxi WHERE
ts BETWEEN 1211840888 - 10 * 60 and 1211840888 -- Extract taxi's position with 10 minute tim window
ORDER BY TS) as foo
GROUP BY taxiid) as foo1
WHERE foo1.taxi_occupancy = 0 -- Choose that are currently available
ORDER BY taxi_distance -- Order by distance to where location is made
LIMIT 10;
```