Exact K-Nearest Neighbors

In this post we show how easy it is to find the exact K-nearest neighbors for a large number of points on a road network. In this example we use a GiST spatial in Postgres. This example can also be done using without a GiSTindex. Our dataset consists of 49,573 fast food restaurants obtained from http://fastfoodmaps.com/data.html.

Dataset

The schema of the relation and the index on it are as follows:

CREATE TABLE restz4
(
  id integer,
  lat double precision,
  lon double precision,
  code bigint
);

UPDATE restz4 set code = Z2(lat, lon);
CREATE INDEX id_index_key ON public.restz4 USING btree (id);
CREATE INDEX code_index_key ON public.restz4 USING btree (code);
CREATE INDEX lat_lon_index_key ON public.restz4 USING btree (lat, lon); 

-- Create GiST Index
ALTER TABLE restz4 ADD COLUMN gid serial PRIMARY KEY;
ALTER TABLE restz4 ADD COLUMN geom geometry(POINT,4326);
UPDATE restz4 SET geom = ST_SetSRID(ST_MakePoint(lon,lat),4326);
CREATE INDEX idx_restz4_geom ON restz4 USING GIST(geom);
CREATE INDEX idx_rest4_geom ON public.restz4 USING gist (geom);

How to compute exact neighbors?

The strategy to computing the exact K-nearest neighbors is as follows:

  • Use the GiST spatial index index to find the K-spatial nearest neighbors. Spatial distance here refers to geodesic or “crow-flying” distance.
  • For each point, look at the kth spatial neighbor and compute its network distance. Let this distance be denoted by “dist”
  • Obtain all points that lie within a distance of “dist” around the point. A quick dirty way of doing is by performing a window query around each point based on a range on the latitude and longitude. Since one 1 degree of latitude/longitude corresponds to 110kms, we simply divide dist by 110,000 and use it as the side length of a box around each point
  • Compute the actual network distance of all points within this box and choose the K closest.
  • This method is correct since the spatial distance is a lower-bound of the network distance. We are guaranteed to find the K actual network neighbors within a distance of dist.

SQL Query

This query finds the 10 neighbors for each of the roughly 50k restaurants among nearby restaurants.

SELECT foo1.id as id1, restz4.id as id2,
       dist(foo1.code, restz4.code) as networkdist
   FROM (
       SELECT * FROM (
            SELECT y.id as id, y.lat as lat, y.lon as lon, y.code as code, (
                SELECT dist(x.code, y.code) / 110000 as dist -- 110km is roughly 1 degree lat/lon
                   FROM restz4 x
                   WHERE x.gid != y.gid
                   ORDER BY x.geom <-> st_setsrid(y.geom, 4326)
                   OFFSET 10   -- Value of K: Find the 10 spatial neighbor
                   LIMIT 1
                )
            FROM restz4 y
      ) AS FOO
    ) AS FOO1, restz4
    WHERE
        restz4.lat between FOO1.lat - dist and FOO1.lat + dist AND
        restz4.lon between FOO1.lon - dist and FOO1.lon + dist AND
        foo1.id != restz4.id;

Performance

When k = 10, it takes 166 seconds to compute 10 neighbors for each of the nearly 50k restaurants. It makes 1.02 million network distance computations on the road network in the process of computing the nearest neighbors.