City Distance Oracles (CDO)

We presented  “CDO: Extremely High-Throughput Road Distance Computations on City Road Networks (Best Demo Award)” in SIGSPATIAL 2016. We demonstrate a solution, termed City Distance Oracles (CDO), using our previously developed -distance oracle to achieve as many as 7 million shortest distance computations per second per commodity machine on a city road network, i.e., 10K × 10K origin-distance (OD) matrix can be finished in 14 seconds.

The reaction of a number of companies that make use of such spatial analytic queries was that typical queries are concentrated in a small local region rather than the whole continental region, termed the spatial concentration property. As an example of such a use-case is a delivery company that needs to plan the delivery route for each truck every day, where the route of each truck must be restricted into a local region, i.e., the region near to the package warehouse. In particular, each such warehouse handles 1, 000 to 10, 000 packages per day, and each truck can deliver a maximum of 150 packages per route. In order to efficiently assign the packages to trucks and plan routes, the delivery company computes a distance matrix that captures the distance between every pair of destination locations of the packages, This is a common spatial analytic query which makes between 1 million and 100 million distance computations. Here, the spatial concentration property means that in the general case, all destination locations of packages must be in proximity to the warehouse, say within 100KM.

Job Opportunities:

application.png

We use the LEHD dataset to obtain the job locations around the Bay Area. The workload shows the nearby job opportunities (e.g., within 10 kms) for each census block in the Bay Area, requiring 120 million distance computations, where CDO finished it within 18 seconds.

 

Google Maps Does Not Do Real Proximity Search!

Our technology can perform millions of nearest road distance neighbor queries on a road network. People often counter our claim by telling us that “Google Maps already does this!”. The short answer, is: No, Google does not do it!  In this post, we dissect what Google Maps can and cannot do when it comes to finding the nearest object to a query point. You may be surprised to learn that Google Maps uses crow-flying distance (i.e., it assumes that you can fly over buildings and lakes). What is even worse is that Google Maps quietly buries this rather obvious deficiency in their service.

Why is this a big deal? Imagine that you are driving and want to look for gas stations. Don’t you want to know which is the closest one? The fundamental operation in location-based services involves proximity determination. And, when you are using Google Maps on your phone, correct ordering of query results is very important. If you still don’t believe me, take out your phone and start a Google Maps App. Search for “Gas Station”. Do you see an option to sort by distance on your Google Maps App?  We are using Google Maps 9.12.1 and we do not find that option!

If you are wonderiing why Google would miss providing a capability as fundamental as proximity search, the only answer that we think of  is that it is because it is HARD! Computing nearest neighbors in a road network  at scale is very challenging  without a technology like ours. Here is our take down of  local search with Google Maps.  The screenshot below represents the result of a simple query to find the Moroccan restaurants near Broadway and Grand Streets in Bayonne, NJ circa 2010

fig-google-maps-spatial-distance

Google Maps (Circa 2010. As an aside, W Grant St no longer exists in Bayonne, NJ in Google Maps). The red boxes are the distance labels provided by Google Maps for each of the restaurants that it retrieved which is clearly the crow-flying distance. The white boxes are the actual road network distances that we computed for each of the resultin restaurants. The green boxes correspond to the actual order of the restaurants when we sorted them by the road network distance. As you can easily see,  the order is off.

Until recently. Google Maps used to look like the above screenshot. One could search for the nearest restaurants near an address. In this case, we searched for “Moroccan restaurants” from an address in Bayonne, NJ which is a small city that lies across the Hudson river,  a stone throw distance away from NYC. The  Google Maps search  would provide a result where its constituent objects are ordered by the crow-flying distance from the query location. The earlier version of Google Maps was explicit (using the direction labels NE, E, SE etc.) in admitting that the restaurants were sorted by the crow-flying distance. The red boxes are the distance values  as displayed by Google (which we zoomed in and enlarged so that one can read them clearly) . It is not surprising that ordering the restaurants by the crow-flying l distance provides an incorrect ordering when we found their actual network distances and reordered them. In the screenshot above, we annotate each restaurant by its real position in the ordering in terms its actual road network distance using the white boxes, and its position in this ordering using the green boxes.  We can see that the discrepancy between using the different distances is really high.  For example, (A is the closest using a crow-flying distance ordering while it is 5th in the road network ordering.

Recently, Google maps became a bit “sneaky” about how they order the restaurants.   We say sneaky because they go to great lengths to hide the fact that they do not (cannot do) real road distance ordering.  You might ask why? Of course, it is expensive to calculate so many distance values.  First they dropped the sort-by-distance feature from the user interface (UI). Presently, you can order the restaurants by their rating as well as by which ones are still open.  As of  June 8th, 2016, the UI looks as like this.You should also try this query on the Google Maps App that runs on your smartphone.

GoogleMaps-June-8th-2016

While the option to sort restaurants by distance disappeared from the UI, the Google Place API still has this option. We found it rather strange that  distance is a less important measure than price, rating, and hours open. Here is the relevant section corresponding to sorting results by distance. https://developers.google.com/places/web-service/search#PlaceSearchRequests . Note that the description is vague in terms of what kind of distance is.used.   Is it crow-flying or or is it actual road network distance?

rankby — Specifies the order in which results are listed. Possible values are:

  • prominence (default). This option sorts results based on their importance. Ranking will favor prominent places within the specified area. Prominence can be affected by a place’s ranking in Google’s index, global popularity, and other factors.
  • distance. This option biases search results in ascending order by their distance from the specified location. When distance is specified, one or more of keyword, name, ortype is required.

We thought that maybe if we looked at the API result, it would shed more light on this matter. Hence we used the API function to obtain the nearest neighbor using the following command:

wget ‘https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=40.745554,-74.024788&type=restaurant&name=Moroccan&rankby=distance&key=API_KEY‘ -O – > moroccan.txt

The result of this query is a JSON object and and the attributes of the first result are given below. One item worhty of noting here is that the distance field is conveniently missing so there is no way to really tell how the restaurants are ordered! In other words, Google  gives the result as an ordered set of restaurants but they dol not tell us what is the distance to the restaurant. How strange and WHY?

{
         "geometry" : {
            "location" : {
               "lat" : 40.7382731,
               "lng" : -74.0094302
            }
         },
         "icon" : "https://maps.gstatic.com/mapfiles/place_api/icons/restaurant-71.png",
         "id" : "2101251f9292829b759665d48357b41f929daabb",
         "name" : "Cafe Gitane",
         "opening_hours" : {
            "open_now" : true,
            "weekday_text" : []
         },
         "photos" : [
            {
               "height" : 3024,
               "html_attributions" : [
                  "\u003ca href=\"https://maps.google.com/maps/contrib/108258590706343440244/photos\"\u003eEnrico Zanardo\u003c/a\u003e"
               ],
               "photo_reference" : "CoQBdwAAABnNJSzjdBdNz5MLXV3gszFAmRr5f1LckNgtPkl5vjO8vL5E0hbvnNXrAexdecYmfjkYboawO-m4fv9tp9LWrbkKT8_RBKFp8cnosLM1A1EB-Zghpi5E8Tcf1qSKoV20bWQPbnQTI13UXvi5WrAenMWzE5_Mx3fShT0WcHwnQdoREhDTXOvArQC2uUoH1BrFuA8GGhRKfpGuPJJ1D8F7bACVfIFvcy2dXw",
               "width" : 4032
            }
         ],
         "place_id" : "ChIJ7eAGYupZwokRzdxbaFHhXJM",
         "price_level" : 2,
         "rating" : 4.2,
         "reference" : "CmRfAAAAyrYukNSqOB98TSfaRxrKGsdLht8horRYBUJrn7_BORQ-PoN9LXidlWGRqj3Eid4CqlNkX-vFUrzj3Xf8zgwAaEjWgkPsXhsEKQjPUYE8HEi_NVMvRvtHCVUpHcmLVkuEEhBkSnbEJeG2rnba9nJVY_G1GhTQveenlP24MwmVtPMQTAjeSyjHxA",
         "scope" : "GOOGLE",
         "types" : [ "cafe", "restaurant", "food", "point_of_interest", "establishment" ],
         "vicinity" : "113 Jane Street, New York"
      },

 

We are now in a real dilemma.  Looking at the output we have no idea what ordering Google used since they don’t provide a distance field. How can we prove or disprove anything now by simply looking at the output? To give you an idea, can we look at the result and deduce what ordering they use? We started with two candidates sort orders that we could think of, actual road network distance (as computed by Google) and crow-flying distance (computed with a standard closed form equation using the query point and the restaurant location). To examine what ordering Google used, we deployed the following methodology.

  • Make a query for Moroccan restaurants from a random query point near and around Jersey City, NJ
  • Obtain the nearest restaurants using the Google API, using the rankby distance feature
  • For each neighbor obtain the actual network distance from Google Maps and compute the crow-flying distance
  • Check if the result is a monotonically increasing ordering of the restaurants by their road network distances
  • Or check if the result is an ordering of the restaurants by their crow-flying distances.

We made 100 random queries around Jersey City, NJ. Here are some results.

  1. All 100 results were found to be sorted by the crow-flying distance. We are now confident that Google Maps indeed uses crow-flying distance.
  2. Only 12 of these 100 were sorted by the actual road network distance, while 88 of the results did not conform to the road network distance. In other words, there was a result i and i+1  such the road network distance of i+1 was less than that of i.  The 12 queries that provided correct road network distance ordering simply corresponded to the cases where the crow-flying distance ordering was the same as the road network distance ordering.

Our analysis has shown without any doubt that Google Maps search does not use actual road network distance.  Instead it simply provides a crow-flying distance ordering. However, there are now more unanswered questions than answers.

  1. Why is Google not able to provide the actual road network distance result but resorts to crow-flying distance? A company with Google’s resources should be able to provide road network distance or trip time. Are you not surprised by that? One would expect that they would even take current traffic into account.
  2. Why does this feel like Google Maps is silently pushing this whole deficiency under the rug here? Why not be more forthright about what sort ordering they use?

So there you have it. Google does not give you restaurants sorted by road network distance.

The code for running the test is here

 

#! /usr/bin/perl

use Geo::Distance::Google;
use JSON::XS;
use Data::Dumper qw/Dumper/;

my $geo = Geo::Distance::Google->new; 

# Start somewhere in Jersey City, NJ
my $lat = 40.745554;
my $lon = -74.024788;
my $num_tests = 100;
my $key = "YOUR_API_KEY_GOES_HERE";

my $tot_roads = 0;
my $tot_spatial = 0;
my $pi = atan2(1,1) * 4;

# Do a num_tests rounds and check if the crow-flying or road distance ordering holds?
for (my $i = 0; $i < $num_tests; ++$i) {
  my $road_failed = 0;
  my $spatial_failed = 0;

  # Jiggle the point a bit and try the query
  my $sign_lat = 1;
  $sign_lat = -1 if (rand() < 0.5);
  my $sign_lon = 1;
  $sign_lon = -1 if (rand() < 0.5);   $lat += $sign_lat * rand() / 30;    $lon += $sign_lon * rand() / 30;    # Easier than using a API..   my $text = `wget "https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=$lat,$lon&type=restaurant&name=Moroccan&rankby=distance&key=$key" -O -`;   my $json = JSON::XS->new->decode($text) or die $!;
  # print Dumper($json);

  my $count = 0;
  my $road_dist = undef;
  my $spatial_dist = undef;
  my $oroad_dist;
  my $ospatial_dist;

  # Process the JSON output
  foreach my $result (@{$json->{results}}) {
      $count++;
      my $rest_lat = $result->{geometry}->{location}->{lat};
      my $rest_lon = $result->{geometry}->{location}->{lng};

      my $distance = $geo->distance(
         origins      => "$lat, $lon",
         destinations => "$rest_lat, $rest_lon",
      );

      my $road_dist = $distance->[0]->{destinations}->[0]->{distance}->{text};
      my $spatial_dist = distance($lat, $lon, $rest_lat, $rest_lon, "K");

      # See if the result follows a monotically increasing order of network distance
      if ($count > 1 && $oroad_dist > $road_dist) {
           $road_failed = 1;
      }

      # See if the result follows a monotically increasing order of crow-flying distance
      if ($count > 1 && $ospatial_dist > $spatial_dist) {
          $spatial_failed = 1;
      }

      $oroad_dist = $road_dist;
      $ospatial_dist = $spatial_dist;

      # Print out all the details of the restaurants and the two distances
      print $count, " ", $rest_lat, " ", $rest_lon, " ", $road_dist, " ", $spatial_dist."KM", "\n";
  }

  print "Did Road Ordering Fail? ", $road_failed, "\n";
  print "Did Crow-flying ordering Fail?", $spatial_failed, "\n";

  $tot_roads += $road_failed;
  $tot_spatial += $spatial_failed;
}

print "Total Tests = ", $num_tests, "\n";
print "Total Roads failed = ", $tot_roads, "\n";
print "Total Spatial failed = ", $tot_spatial, "\n";

# Here is a bit of code borrowed from the following website to compute
# crow-flying distance.

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
#:::                                                                         :::
#:::  This routine calculates the distance between two points (given the     :::
#:::  latitude/longitude of those points). It is being used to calculate     :::
#:::  the distance between two locations using GeoDataSource(TM) products    :::
#:::                                                                         :::
#:::  Definitions:                                                           :::
#:::    South latitudes are negative, east longitudes are positive           :::
#:::                                                                         :::
#:::  Passed to function:                                                    :::
#:::    lat1, lon1 = Latitude and Longitude of point 1 (in decimal degrees)  :::
#:::    lat2, lon2 = Latitude and Longitude of point 2 (in decimal degrees)  :::
#:::    unit = the unit you desire for results                               :::
#:::           where: 'M' is statute miles (default)                         :::
#:::                  'K' is kilometers                                      :::
#:::                  'N' is nautical miles                                  :::
#:::                                                                         :::
#:::  Worldwide cities and other features databases with latitude longitude  :::
#:::  are available at http://www.geodatasource.com	                     :::
#:::                                                                         :::
#:::  For enquiries, please contact sales@geodatasource.com                  :::
#:::                                                                         :::
#:::  Official Web site: http://www.geodatasource.com                        :::
#:::                                                                         :::
#:::            GeoDataSource.com (C) All Rights Reserved 2015               :::
#:::                                                                         :::
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

sub distance {
	my ($lat1, $lon1, $lat2, $lon2, $unit) = @_;
	my $theta = $lon1 - $lon2;
	my $dist = sin(deg2rad($lat1)) * sin(deg2rad($lat2)) + cos(deg2rad($lat1)) * cos(deg2rad($lat2)) * cos(deg2rad($theta));
  $dist  = acos($dist);
  $dist = rad2deg($dist);
  $dist = $dist * 60 * 1.1515;
  if ($unit eq "K") {
  	$dist = $dist * 1.609344;
  } elsif ($unit eq "N") {
  	$dist = $dist * 0.8684;
		}
	return ($dist);
}

#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
#:::  This function get the arccos function using arctan function   :::
#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
sub acos {
	my ($rad) = @_;
	my $ret = atan2(sqrt(1 - $rad**2), $rad);
	return $ret;
}

#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
#:::  This function converts decimal degrees to radians             :::
#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
sub deg2rad {
	my ($deg) = @_;
	return ($deg * $pi / 180);
}

#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
#:::  This function converts radians to decimal degrees             :::
#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
sub rad2deg {
	my ($rad) = @_;
	return ($rad * 180 / $pi);
}

Use Case in Local Advertisement

One common use-case one encounters in location-based advertisement is that there is a set of businesses that want to place ads to customers that are nearby. These ads, that are placed on the mobile phones of customers, should first be relevant. The customers receiving the ads should be proximal to the business. The hope here is that the advertisement (that may contain a coupon) customers will incentivise them to go from from their present location to the place where the business is situated. In traditional web-based advertisement world, Click-Through Rate (CTR) is considered a metric of success which roughly corresponds to how many people click on an advertisement. In this local advertisement space, success is measured by Store Visitation Lift (SVL), which is a measure of many people actually end up going to the business and redeeming the coupon they received over the wire. Hence, local advertisement is much more directly connected to the bottomline of a business.

When sending a coupon to a customer, one has to be careful not to send a coupon for a business that a customer cannot really reach. Most existing advertisement companies use crow-flying distance as a measure of proximity of a customer from a business, but this can be very erroneous. For instance, there may be a bay or a lake between the customer’s current location and the business. In this case, even though the customer is proximal when it comes to crow-flying distance, the actual road distance could be very large. There could possibly be a toll road between the person’s current location and the business, which almost guarantees that the person would never end up using the coupon. If one is not careful, the SVL metric could be very low.

For this use-case one needs a method that can accurately compute the actual road distance or the trip time distance between customers and businesses. Another requirement is that as these ads need to be delivered at a very high rate. Hence, one needs a method that can compute road distances or trip time at a very high throughput.

In this article, we develop a simple use-case using two tables, customers and business. In this example, we will show how one can use our technology to find all customers within a fixed distance of a business.The customer table records the current position of the customer, while the business table records the positions of the businesses.

We first create a simple customer table with a id, latitude and longitude information and load synthetic data from a CSV file.

drop table customers;
create table customers (id int, lat double precision, lon double precision);
\copy customers from customers.txt with DELIMITER E'\t';

Create the necessary indices for the customer table. Note that we do not need any kind of special indices.

alter table customers add column code bigint;
update customers set code = Z2(lat, lon);
create index customer_idx on customers (id);
create index customer_code on customers(code);

Create the business table and necessary indices similar to the customer table above.

drop table business;
create table business (id int, lat double precision, lon double precision);
\copy business from business.txt with DELIMITER E'\t';
alter table business add column code bigint;
update business set code = Z2(lat, lon);
create index business_idx on business (id);
create index business_code on business(code);

We first write a simple function that takes an id of a business and a distance value. The MinMaxCode function takes these as inputs and produces two codes corresponding to the morton code of the top-left and bottom-right corner of the bounding box denoting the search region. All the customer.code values that are contained within these values will be found by scanning between these two values.

CREATE OR REPLACE FUNCTION MinMaxCode(INT, DOUBLE PRECISION)
   RETURNS TABLE (minCode BIGINT, maxCode BIGINT,
                  lat double precision, lon double precision)
    LANGUAGE sql IMMUTABLE AS
          'SELECT Z2(business .lat - $2 /110.0,
           business.lon - $2 * COS(RADIANS(business.lat))/110.0)
           as code1,
           NYZ2(business.lat + $2/110.0,
           business.lon + $2 * COS(RADIANS(business.lat))/110.0)
           as code2, lat, lon FROM business
    WHERE id = $1;';

This final function findNearest actually performs the region search around the selected business and is aided by the Btree indices on the tables.

CREATE OR REPLACE FUNCTION findNearest(INT, DOUBLE PRECISION)
   RETURNS SETOF RECORD LANGUAGE sql IMMUTABLE AS
     'SELECT * FROM (
                    SELECT id, dist(customers.lat,
                           customers.lon, (foo.minmaxcode).lat,
                          (foo.minmaxcode).lon) as dist
                    FROM customers,
                        (SELECT MinMaxCode($1, $2/1000)) as foo
     WHERE code >= (foo.minmaxcode).minCode
           AND code <= (foo.minmaxcode).maxCode) AS FOO1
     WHERE dist <= $2;';

We can now quickly search for all customers within 100 meters of business id 321 in a fraction of a second. This will return a customer id and a distance in meters.

SELECT findNearest(321, 100);

One can also use nearest neighbor searches to match businesses with customers. Please look at our use-case for computing the nearest neighbors between two sets of points at a high-throughput, which is directly applicable to this use-case.

Patents

Our technology has one issued patent and a number of applications under submission. Please refer to the following patent on details on how we are able to provide a high throughput distance computation.

Patent Number: US8744770B2

One embodiment of the invention is directed to a method including constructing a path-distance oracle that provides both an intermediate vertex of a shortest path between two vertices in a spatial network and an approximate distance between the two vertices. The constructing comprises decomposing the spatial network into a set of path-coherent pairs (PCPs) that satisfy at least one predefined property.


Inventor
Hanan Samet
Jagan Sankaranarayanan
Grant date
2014-06-03

https://patents.google.com/patent/US20120136575A1

Real Estate Use Case

Buying a house is a long-term investment in the sense that a family stays in a house for an average of 13 years. The quality of the neighborhood in terms of parks, shopping malls, bookstores are important factors to consider when purchasing a home. But the most important factor is taking the commute times into account which can really make a big impact on the quality of life. From our discussions with a number of real estate companies, there is quite a bit of interest in providing potential homeowners the ability to restrict homes based on the commute distance for the potential homeowner, spouse and children. Using our technology we can easily minimize the sum of the distance from each home to office location, office location spouse and nearest school.

 

Suppose that there is a table called homes of all the available homes, office1 and office2 are the two workplace locations and schools is a table of all schools in the country.
An unoptimized version of the query would be something like this, that orders each house in the market by the sum of the commute distance for yourself, spouse and child. And this query can be performed really fast in a matter of a few milli-seconds with little effort. This query took us less than 2 minutes to write which shows how quickly such a feature can be added. Furthermore, given that the query is written in SQL the database takes care of the bulk of optimization.

SELECT home_id, sum(d1 + d2 + s1) as total_distance AS
(SELECT h.home_id,
dist(h.lat, h.lon, office1_lat, office1_lon) as d1,
dist(h.lat, h.lon, office2_lat, office2_lon) as d2,
(SELECT dist(h.lat, h.lon, schools.lat, schools.lon) as s1
FROM schools ORDER BY s1 LIMIT 1) as s1
FROM homes h) as foo
ORDER BY total_distance
LIMIT 10

Solutions for Taxi Services

taxuFare Estimator:

Many taxi companies like Uber offer fare finders where the user enters starting and ending address and the service returns an approximate fare range. Now one can do that using a single line of SQL.

-- c1 and c2 are distance, time multipliers to compute fare
SELECT dist(lat1, lon1, lat2, lon2) * c1 + 
       triptime(lat1, lon1, lat2, lon2) * c2;
-- This produces fare in dollars. Estimate could be +/- 20% 

Where are the nearest taxis?

A common feature in many taxi app is to provide a map of the nearest taxi. This can be done using just a few lines of SQL.

-- lat, lon is where a passenger is present
-- taxi(id, lat, lon, available) current taxi information
SELECT taxi.id, dist(lat, lon, taxi.lat, taxi.lon) as distance
 FROM taxi 
 WHERE taxi.available = true AND
    distance &amp;amp;amp;amp;lt;= 10 * 1000  --- 10 kms
 ORDER BY distance;

Where is my assigned taxi?

Once passenger is assigned a taxi, one wants to provide an accurate time of arrival in seconds for the customer. This also can be done using a few lines of SQL.

-- lat, lon is where a passenger is present
-- taxi(id, lat, lon, available) current taxi information
-- taxi_id is the assigned taxi
SELECT taxi.id, triptime(lat, lon, taxi.lat, taxi.lon) as seconds_to_arrival
 FROM taxi 
 WHERE taxi.id = tax_id;

Car-pooling Service:

Suppose a taxi company wants to offer a car-pooling service where the taxi picks as many passengers as possible with minimal detours. This means that frequently one needs to compute available passengers and the order in which to pick up them.

The first thing one needs to compute is a the distance matrix between all the waiting passengers. For example, if a driver has to pick up 10 passengers, we need to compute the network distance between every pair of delivery address which is then fed into an optimization algorithm. Now if one super scales the problem to 100 taxis and 1000 passengers, we can handle this query easily using your existing commodity hardware.

We can compute 1000 x 1000 distance matrix (1 million distance computations) is less than 30 seconds on a commodity server running PostgreSQL. A 100 x 100 distance matrix is just a fraction of a second.

-- waiting_passengers(id, lat, lon) is a table of passengers 
-- yet to be assigned to taxis.
SELECT A.id as id1, B.id as id2, dist(A.lat, A.lon, B.lat, B.lon) as dist
   FROM waiting_passengers as A, waiting_passengers as B;
-- Result is a distance matrix

Trajectory Analysis:

Taxi services collect the GPS trajectories of their taxis which becomes a rich source of analysis for optimization efficiency. Look at our detailed post on what can be done with large-scale trajectory archives.

Fare Scams by Drivers:

If you own a fleet of taxis and want to find if a driver is overcharging your passengers, you can compute the difference between the distance along the trajectories vs. the distance from start to destination. Look at our detailed post on what can be done with large-scale trajectory archives.

 

 

One Million Network Distances

To setup our simple benchmark of 1 million network distances, we choose 1000 hospitals and 1000 parks at random and then take their cartesian product (i.e., cross product) to create 1 million unique source and destination pairs. We store these 1 million tuples in a table called performance_dist.

CREATE TABLE performance_dist as
  SELECT foo1.code as code1, foo2.code as code2, z4(foo1.code, foo2.code) as codez4 FROM
  (SELECT * from hospitals limit 1000) as foo1,
  (SELECT * from parks limit 1000) as foo2;

Now that we have created the 1 million sources and destinations, the first variant simply applies the dist() function using a sequential scan of performance_dist.  This query takes 32 seconds which roughly translates to about 33k network distances/second

SELECT dist(code1, code2) from performance_dist;

Second variant uses a B-tree index on code1, code2 and takes 29 seconds, which roughly translates to about 34k distances/second. This is a little better than a sequential scan but not so much better.

CREATE INDEX performance_code1_code2 on performance_dist(code1, code2);
SELECT dist(code1, code2) from performance_dist order by code1, code2;

Third variant is aided by an index scan on codez4 and takes only 25 seconds which is about 40k network distances/second.

CREATE INDEX performance_codez4 on performance_dist(codez4);
SELECT dist(codez4) from performance_dist order by codez4;

How can we improve this further? (a) Really we can be running several concurrent queries per second since PostgreSQL uses one thread for each of these queries (b) We use an xlarge RDS machine in Amazon AWS but should be lot faster on a in-house PostgreSQL installation. Please let us know what kinds of throughput you see on your in-house machine.