To setup our simple benchmark of 1 million network distances, we choose 1000 hospitals and 1000 parks at random and then take their cartesian product (i.e., cross product) to create 1 million unique source and destination pairs. We store these 1 million tuples in a table called performance_dist.
CREATE TABLE performance_dist as SELECT foo1.code as code1, foo2.code as code2, z4(foo1.code, foo2.code) as codez4 FROM (SELECT * from hospitals limit 1000) as foo1, (SELECT * from parks limit 1000) as foo2;
Now that we have created the 1 million sources and destinations, the first variant simply applies the dist() function using a sequential scan of performance_dist. This query takes 32 seconds which roughly translates to about 33k network distances/second
SELECT dist(code1, code2) from performance_dist;
Second variant uses a B-tree index on code1, code2 and takes 29 seconds, which roughly translates to about 34k distances/second. This is a little better than a sequential scan but not so much better.
CREATE INDEX performance_code1_code2 on performance_dist(code1, code2); SELECT dist(code1, code2) from performance_dist order by code1, code2;
Third variant is aided by an index scan on codez4 and takes only 25 seconds which is about 40k network distances/second.
CREATE INDEX performance_codez4 on performance_dist(codez4); SELECT dist(codez4) from performance_dist order by codez4;
How can we improve this further? (a) Really we can be running several concurrent queries per second since PostgreSQL uses one thread for each of these queries (b) We use an xlarge RDS machine in Amazon AWS but should be lot faster on a in-house PostgreSQL installation. Please let us know what kinds of throughput you see on your in-house machine.