City Distance Oracles (CDO)

We presented  “CDO: Extremely High-Throughput Road Distance Computations on City Road Networks (Best Demo Award)” in SIGSPATIAL 2016. We demonstrate a solution, termed City Distance Oracles (CDO), using our previously developed -distance oracle to achieve as many as 7 million shortest distance computations per second per commodity machine on a city road network, i.e., 10K × 10K origin-distance (OD) matrix can be finished in 14 seconds.

The reaction of a number of companies that make use of such spatial analytic queries was that typical queries are concentrated in a small local region rather than the whole continental region, termed the spatial concentration property. As an example of such a use-case is a delivery company that needs to plan the delivery route for each truck every day, where the route of each truck must be restricted into a local region, i.e., the region near to the package warehouse. In particular, each such warehouse handles 1, 000 to 10, 000 packages per day, and each truck can deliver a maximum of 150 packages per route. In order to efficiently assign the packages to trucks and plan routes, the delivery company computes a distance matrix that captures the distance between every pair of destination locations of the packages, This is a common spatial analytic query which makes between 1 million and 100 million distance computations. Here, the spatial concentration property means that in the general case, all destination locations of packages must be in proximity to the warehouse, say within 100KM.

Job Opportunities:

application.png

We use the LEHD dataset to obtain the job locations around the Bay Area. The workload shows the nearby job opportunities (e.g., within 10 kms) for each census block in the Bay Area, requiring 120 million distance computations, where CDO finished it within 18 seconds.

 

Google Maps Does Not Do Real Proximity Search!

Our technology can perform millions of nearest road distance neighbor queries on a road network. People often counter our claim by telling us that “Google Maps already does this!”. The short answer, is: No, Google does not do it!  In this post, we dissect what Google Maps can and cannot do when it comes to finding the nearest object to a query point. You may be surprised to learn that Google Maps uses crow-flying distance (i.e., it assumes that you can fly over buildings and lakes). What is even worse is that Google Maps quietly buries this rather obvious deficiency in their service.

Why is this a big deal? Imagine that you are driving and want to look for gas stations. Don’t you want to know which is the closest one? The fundamental operation in location-based services involves proximity determination. And, when you are using Google Maps on your phone, correct ordering of query results is very important. If you still don’t believe me, take out your phone and start a Google Maps App. Search for “Gas Station”. Do you see an option to sort by distance on your Google Maps App?  We are using Google Maps 9.12.1 and we do not find that option!

If you are wonderiing why Google would miss providing a capability as fundamental as proximity search, the only answer that we think of  is that it is because it is HARD! Computing nearest neighbors in a road network  at scale is very challenging  without a technology like ours. Here is our take down of  local search with Google Maps.  The screenshot below represents the result of a simple query to find the Moroccan restaurants near Broadway and Grand Streets in Bayonne, NJ circa 2010

fig-google-maps-spatial-distance

Google Maps (Circa 2010. As an aside, W Grant St no longer exists in Bayonne, NJ in Google Maps). The red boxes are the distance labels provided by Google Maps for each of the restaurants that it retrieved which is clearly the crow-flying distance. The white boxes are the actual road network distances that we computed for each of the resultin restaurants. The green boxes correspond to the actual order of the restaurants when we sorted them by the road network distance. As you can easily see,  the order is off.

Until recently. Google Maps used to look like the above screenshot. One could search for the nearest restaurants near an address. In this case, we searched for “Moroccan restaurants” from an address in Bayonne, NJ which is a small city that lies across the Hudson river,  a stone throw distance away from NYC. The  Google Maps search  would provide a result where its constituent objects are ordered by the crow-flying distance from the query location. The earlier version of Google Maps was explicit (using the direction labels NE, E, SE etc.) in admitting that the restaurants were sorted by the crow-flying distance. The red boxes are the distance values  as displayed by Google (which we zoomed in and enlarged so that one can read them clearly) . It is not surprising that ordering the restaurants by the crow-flying l distance provides an incorrect ordering when we found their actual network distances and reordered them. In the screenshot above, we annotate each restaurant by its real position in the ordering in terms its actual road network distance using the white boxes, and its position in this ordering using the green boxes.  We can see that the discrepancy between using the different distances is really high.  For example, (A is the closest using a crow-flying distance ordering while it is 5th in the road network ordering.

Recently, Google maps became a bit “sneaky” about how they order the restaurants.   We say sneaky because they go to great lengths to hide the fact that they do not (cannot do) real road distance ordering.  You might ask why? Of course, it is expensive to calculate so many distance values.  First they dropped the sort-by-distance feature from the user interface (UI). Presently, you can order the restaurants by their rating as well as by which ones are still open.  As of  June 8th, 2016, the UI looks as like this.You should also try this query on the Google Maps App that runs on your smartphone.

GoogleMaps-June-8th-2016

While the option to sort restaurants by distance disappeared from the UI, the Google Place API still has this option. We found it rather strange that  distance is a less important measure than price, rating, and hours open. Here is the relevant section corresponding to sorting results by distance. https://developers.google.com/places/web-service/search#PlaceSearchRequests . Note that the description is vague in terms of what kind of distance is.used.   Is it crow-flying or or is it actual road network distance?

rankby — Specifies the order in which results are listed. Possible values are:

  • prominence (default). This option sorts results based on their importance. Ranking will favor prominent places within the specified area. Prominence can be affected by a place’s ranking in Google’s index, global popularity, and other factors.
  • distance. This option biases search results in ascending order by their distance from the specified location. When distance is specified, one or more of keyword, name, ortype is required.

We thought that maybe if we looked at the API result, it would shed more light on this matter. Hence we used the API function to obtain the nearest neighbor using the following command:

wget ‘https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=40.745554,-74.024788&type=restaurant&name=Moroccan&rankby=distance&key=API_KEY‘ -O – > moroccan.txt

The result of this query is a JSON object and and the attributes of the first result are given below. One item worhty of noting here is that the distance field is conveniently missing so there is no way to really tell how the restaurants are ordered! In other words, Google  gives the result as an ordered set of restaurants but they dol not tell us what is the distance to the restaurant. How strange and WHY?

{
         "geometry" : {
            "location" : {
               "lat" : 40.7382731,
               "lng" : -74.0094302
            }
         },
         "icon" : "https://maps.gstatic.com/mapfiles/place_api/icons/restaurant-71.png",
         "id" : "2101251f9292829b759665d48357b41f929daabb",
         "name" : "Cafe Gitane",
         "opening_hours" : {
            "open_now" : true,
            "weekday_text" : []
         },
         "photos" : [
            {
               "height" : 3024,
               "html_attributions" : [
                  "\u003ca href=\"https://maps.google.com/maps/contrib/108258590706343440244/photos\"\u003eEnrico Zanardo\u003c/a\u003e"
               ],
               "photo_reference" : "CoQBdwAAABnNJSzjdBdNz5MLXV3gszFAmRr5f1LckNgtPkl5vjO8vL5E0hbvnNXrAexdecYmfjkYboawO-m4fv9tp9LWrbkKT8_RBKFp8cnosLM1A1EB-Zghpi5E8Tcf1qSKoV20bWQPbnQTI13UXvi5WrAenMWzE5_Mx3fShT0WcHwnQdoREhDTXOvArQC2uUoH1BrFuA8GGhRKfpGuPJJ1D8F7bACVfIFvcy2dXw",
               "width" : 4032
            }
         ],
         "place_id" : "ChIJ7eAGYupZwokRzdxbaFHhXJM",
         "price_level" : 2,
         "rating" : 4.2,
         "reference" : "CmRfAAAAyrYukNSqOB98TSfaRxrKGsdLht8horRYBUJrn7_BORQ-PoN9LXidlWGRqj3Eid4CqlNkX-vFUrzj3Xf8zgwAaEjWgkPsXhsEKQjPUYE8HEi_NVMvRvtHCVUpHcmLVkuEEhBkSnbEJeG2rnba9nJVY_G1GhTQveenlP24MwmVtPMQTAjeSyjHxA",
         "scope" : "GOOGLE",
         "types" : [ "cafe", "restaurant", "food", "point_of_interest", "establishment" ],
         "vicinity" : "113 Jane Street, New York"
      },

 

We are now in a real dilemma.  Looking at the output we have no idea what ordering Google used since they don’t provide a distance field. How can we prove or disprove anything now by simply looking at the output? To give you an idea, can we look at the result and deduce what ordering they use? We started with two candidates sort orders that we could think of, actual road network distance (as computed by Google) and crow-flying distance (computed with a standard closed form equation using the query point and the restaurant location). To examine what ordering Google used, we deployed the following methodology.

  • Make a query for Moroccan restaurants from a random query point near and around Jersey City, NJ
  • Obtain the nearest restaurants using the Google API, using the rankby distance feature
  • For each neighbor obtain the actual network distance from Google Maps and compute the crow-flying distance
  • Check if the result is a monotonically increasing ordering of the restaurants by their road network distances
  • Or check if the result is an ordering of the restaurants by their crow-flying distances.

We made 100 random queries around Jersey City, NJ. Here are some results.

  1. All 100 results were found to be sorted by the crow-flying distance. We are now confident that Google Maps indeed uses crow-flying distance.
  2. Only 12 of these 100 were sorted by the actual road network distance, while 88 of the results did not conform to the road network distance. In other words, there was a result i and i+1  such the road network distance of i+1 was less than that of i.  The 12 queries that provided correct road network distance ordering simply corresponded to the cases where the crow-flying distance ordering was the same as the road network distance ordering.

Our analysis has shown without any doubt that Google Maps search does not use actual road network distance.  Instead it simply provides a crow-flying distance ordering. However, there are now more unanswered questions than answers.

  1. Why is Google not able to provide the actual road network distance result but resorts to crow-flying distance? A company with Google’s resources should be able to provide road network distance or trip time. Are you not surprised by that? One would expect that they would even take current traffic into account.
  2. Why does this feel like Google Maps is silently pushing this whole deficiency under the rug here? Why not be more forthright about what sort ordering they use?

So there you have it. Google does not give you restaurants sorted by road network distance.

The code for running the test is here

 

#! /usr/bin/perl

use Geo::Distance::Google;
use JSON::XS;
use Data::Dumper qw/Dumper/;

my $geo = Geo::Distance::Google->new; 

# Start somewhere in Jersey City, NJ
my $lat = 40.745554;
my $lon = -74.024788;
my $num_tests = 100;
my $key = "YOUR_API_KEY_GOES_HERE";

my $tot_roads = 0;
my $tot_spatial = 0;
my $pi = atan2(1,1) * 4;

# Do a num_tests rounds and check if the crow-flying or road distance ordering holds?
for (my $i = 0; $i < $num_tests; ++$i) {
  my $road_failed = 0;
  my $spatial_failed = 0;

  # Jiggle the point a bit and try the query
  my $sign_lat = 1;
  $sign_lat = -1 if (rand() < 0.5);
  my $sign_lon = 1;
  $sign_lon = -1 if (rand() < 0.5);   $lat += $sign_lat * rand() / 30;    $lon += $sign_lon * rand() / 30;    # Easier than using a API..   my $text = `wget "https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=$lat,$lon&type=restaurant&name=Moroccan&rankby=distance&key=$key" -O -`;   my $json = JSON::XS->new->decode($text) or die $!;
  # print Dumper($json);

  my $count = 0;
  my $road_dist = undef;
  my $spatial_dist = undef;
  my $oroad_dist;
  my $ospatial_dist;

  # Process the JSON output
  foreach my $result (@{$json->{results}}) {
      $count++;
      my $rest_lat = $result->{geometry}->{location}->{lat};
      my $rest_lon = $result->{geometry}->{location}->{lng};

      my $distance = $geo->distance(
         origins      => "$lat, $lon",
         destinations => "$rest_lat, $rest_lon",
      );

      my $road_dist = $distance->[0]->{destinations}->[0]->{distance}->{text};
      my $spatial_dist = distance($lat, $lon, $rest_lat, $rest_lon, "K");

      # See if the result follows a monotically increasing order of network distance
      if ($count > 1 && $oroad_dist > $road_dist) {
           $road_failed = 1;
      }

      # See if the result follows a monotically increasing order of crow-flying distance
      if ($count > 1 && $ospatial_dist > $spatial_dist) {
          $spatial_failed = 1;
      }

      $oroad_dist = $road_dist;
      $ospatial_dist = $spatial_dist;

      # Print out all the details of the restaurants and the two distances
      print $count, " ", $rest_lat, " ", $rest_lon, " ", $road_dist, " ", $spatial_dist."KM", "\n";
  }

  print "Did Road Ordering Fail? ", $road_failed, "\n";
  print "Did Crow-flying ordering Fail?", $spatial_failed, "\n";

  $tot_roads += $road_failed;
  $tot_spatial += $spatial_failed;
}

print "Total Tests = ", $num_tests, "\n";
print "Total Roads failed = ", $tot_roads, "\n";
print "Total Spatial failed = ", $tot_spatial, "\n";

# Here is a bit of code borrowed from the following website to compute
# crow-flying distance.

#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
#:::                                                                         :::
#:::  This routine calculates the distance between two points (given the     :::
#:::  latitude/longitude of those points). It is being used to calculate     :::
#:::  the distance between two locations using GeoDataSource(TM) products    :::
#:::                                                                         :::
#:::  Definitions:                                                           :::
#:::    South latitudes are negative, east longitudes are positive           :::
#:::                                                                         :::
#:::  Passed to function:                                                    :::
#:::    lat1, lon1 = Latitude and Longitude of point 1 (in decimal degrees)  :::
#:::    lat2, lon2 = Latitude and Longitude of point 2 (in decimal degrees)  :::
#:::    unit = the unit you desire for results                               :::
#:::           where: 'M' is statute miles (default)                         :::
#:::                  'K' is kilometers                                      :::
#:::                  'N' is nautical miles                                  :::
#:::                                                                         :::
#:::  Worldwide cities and other features databases with latitude longitude  :::
#:::  are available at http://www.geodatasource.com	                     :::
#:::                                                                         :::
#:::  For enquiries, please contact sales@geodatasource.com                  :::
#:::                                                                         :::
#:::  Official Web site: http://www.geodatasource.com                        :::
#:::                                                                         :::
#:::            GeoDataSource.com (C) All Rights Reserved 2015               :::
#:::                                                                         :::
#:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

sub distance {
	my ($lat1, $lon1, $lat2, $lon2, $unit) = @_;
	my $theta = $lon1 - $lon2;
	my $dist = sin(deg2rad($lat1)) * sin(deg2rad($lat2)) + cos(deg2rad($lat1)) * cos(deg2rad($lat2)) * cos(deg2rad($theta));
  $dist  = acos($dist);
  $dist = rad2deg($dist);
  $dist = $dist * 60 * 1.1515;
  if ($unit eq "K") {
  	$dist = $dist * 1.609344;
  } elsif ($unit eq "N") {
  	$dist = $dist * 0.8684;
		}
	return ($dist);
}

#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
#:::  This function get the arccos function using arctan function   :::
#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
sub acos {
	my ($rad) = @_;
	my $ret = atan2(sqrt(1 - $rad**2), $rad);
	return $ret;
}

#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
#:::  This function converts decimal degrees to radians             :::
#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
sub deg2rad {
	my ($deg) = @_;
	return ($deg * $pi / 180);
}

#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
#:::  This function converts radians to decimal degrees             :::
#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
sub rad2deg {
	my ($rad) = @_;
	return ($rad * 180 / $pi);
}

Use Case in Local Advertisement

One common use-case one encounters in location-based advertisement is that there is a set of businesses that want to place ads to customers that are nearby. These ads, that are placed on the mobile phones of customers, should first be relevant. The customers receiving the ads should be proximal to the business. The hope here is that the advertisement (that may contain a coupon) customers will incentivise them to go from from their present location to the place where the business is situated. In traditional web-based advertisement world, Click-Through Rate (CTR) is considered a metric of success which roughly corresponds to how many people click on an advertisement. In this local advertisement space, success is measured by Store Visitation Lift (SVL), which is a measure of many people actually end up going to the business and redeeming the coupon they received over the wire. Hence, local advertisement is much more directly connected to the bottomline of a business.

When sending a coupon to a customer, one has to be careful not to send a coupon for a business that a customer cannot really reach. Most existing advertisement companies use crow-flying distance as a measure of proximity of a customer from a business, but this can be very erroneous. For instance, there may be a bay or a lake between the customer’s current location and the business. In this case, even though the customer is proximal when it comes to crow-flying distance, the actual road distance could be very large. There could possibly be a toll road between the person’s current location and the business, which almost guarantees that the person would never end up using the coupon. If one is not careful, the SVL metric could be very low.

For this use-case one needs a method that can accurately compute the actual road distance or the trip time distance between customers and businesses. Another requirement is that as these ads need to be delivered at a very high rate. Hence, one needs a method that can compute road distances or trip time at a very high throughput.

In this article, we develop a simple use-case using two tables, customers and business. In this example, we will show how one can use our technology to find all customers within a fixed distance of a business.The customer table records the current position of the customer, while the business table records the positions of the businesses.

We first create a simple customer table with a id, latitude and longitude information and load synthetic data from a CSV file.

drop table customers;
create table customers (id int, lat double precision, lon double precision);
\copy customers from customers.txt with DELIMITER E'\t';

Create the necessary indices for the customer table. Note that we do not need any kind of special indices.

alter table customers add column code bigint;
update customers set code = Z2(lat, lon);
create index customer_idx on customers (id);
create index customer_code on customers(code);

Create the business table and necessary indices similar to the customer table above.

drop table business;
create table business (id int, lat double precision, lon double precision);
\copy business from business.txt with DELIMITER E'\t';
alter table business add column code bigint;
update business set code = Z2(lat, lon);
create index business_idx on business (id);
create index business_code on business(code);

We first write a simple function that takes an id of a business and a distance value. The MinMaxCode function takes these as inputs and produces two codes corresponding to the morton code of the top-left and bottom-right corner of the bounding box denoting the search region. All the customer.code values that are contained within these values will be found by scanning between these two values.

CREATE OR REPLACE FUNCTION MinMaxCode(INT, DOUBLE PRECISION)
   RETURNS TABLE (minCode BIGINT, maxCode BIGINT,
                  lat double precision, lon double precision)
    LANGUAGE sql IMMUTABLE AS
          'SELECT Z2(business .lat - $2 /110.0,
           business.lon - $2 * COS(RADIANS(business.lat))/110.0)
           as code1,
           NYZ2(business.lat + $2/110.0,
           business.lon + $2 * COS(RADIANS(business.lat))/110.0)
           as code2, lat, lon FROM business
    WHERE id = $1;';

This final function findNearest actually performs the region search around the selected business and is aided by the Btree indices on the tables.

CREATE OR REPLACE FUNCTION findNearest(INT, DOUBLE PRECISION)
   RETURNS SETOF RECORD LANGUAGE sql IMMUTABLE AS
     'SELECT * FROM (
                    SELECT id, dist(customers.lat,
                           customers.lon, (foo.minmaxcode).lat,
                          (foo.minmaxcode).lon) as dist
                    FROM customers,
                        (SELECT MinMaxCode($1, $2/1000)) as foo
     WHERE code >= (foo.minmaxcode).minCode
           AND code <= (foo.minmaxcode).maxCode) AS FOO1
     WHERE dist <= $2;';

We can now quickly search for all customers within 100 meters of business id 321 in a fraction of a second. This will return a customer id and a distance in meters.

SELECT findNearest(321, 100);

One can also use nearest neighbor searches to match businesses with customers. Please look at our use-case for computing the nearest neighbors between two sets of points at a high-throughput, which is directly applicable to this use-case.

Patents

Our technology has one issued patent and a number of applications under submission. Please refer to the following patent on details on how we are able to provide a high throughput distance computation.

Patent Number: US8744770B2

One embodiment of the invention is directed to a method including constructing a path-distance oracle that provides both an intermediate vertex of a shortest path between two vertices in a spatial network and an approximate distance between the two vertices. The constructing comprises decomposing the spatial network into a set of path-coherent pairs (PCPs) that satisfy at least one predefined property.


Inventor
Hanan Samet
Jagan Sankaranarayanan
Grant date
2014-06-03

https://patents.google.com/patent/US20120136575A1

Real Estate Use Case

Buying a house is a long-term investment in the sense that a family stays in a house for an average of 13 years. The quality of the neighborhood in terms of parks, shopping malls, bookstores are important factors to consider when purchasing a home. But the most important factor is taking the commute times into account which can really make a big impact on the quality of life. From our discussions with a number of real estate companies, there is quite a bit of interest in providing potential homeowners the ability to restrict homes based on the commute distance for the potential homeowner, spouse and children. Using our technology we can easily minimize the sum of the distance from each home to office location, office location spouse and nearest school.

 

Suppose that there is a table called homes of all the available homes, office1 and office2 are the two workplace locations and schools is a table of all schools in the country.
An unoptimized version of the query would be something like this, that orders each house in the market by the sum of the commute distance for yourself, spouse and child. And this query can be performed really fast in a matter of a few milli-seconds with little effort. This query took us less than 2 minutes to write which shows how quickly such a feature can be added. Furthermore, given that the query is written in SQL the database takes care of the bulk of optimization.

SELECT home_id, sum(d1 + d2 + s1) as total_distance AS
(SELECT h.home_id,
dist(h.lat, h.lon, office1_lat, office1_lon) as d1,
dist(h.lat, h.lon, office2_lat, office2_lon) as d2,
(SELECT dist(h.lat, h.lon, schools.lat, schools.lon) as s1
FROM schools ORDER BY s1 LIMIT 1) as s1
FROM homes h) as foo
ORDER BY total_distance
LIMIT 10

Solutions for Taxi Services

taxuFare Estimator:

Many taxi companies like Uber offer fare finders where the user enters starting and ending address and the service returns an approximate fare range. Now one can do that using a single line of SQL.

-- c1 and c2 are distance, time multipliers to compute fare
SELECT dist(lat1, lon1, lat2, lon2) * c1 + 
       triptime(lat1, lon1, lat2, lon2) * c2;
-- This produces fare in dollars. Estimate could be +/- 20% 

Where are the nearest taxis?

A common feature in many taxi app is to provide a map of the nearest taxi. This can be done using just a few lines of SQL.

-- lat, lon is where a passenger is present
-- taxi(id, lat, lon, available) current taxi information
SELECT taxi.id, dist(lat, lon, taxi.lat, taxi.lon) as distance
 FROM taxi 
 WHERE taxi.available = true AND
    distance &amp;amp;amp;amp;lt;= 10 * 1000  --- 10 kms
 ORDER BY distance;

Where is my assigned taxi?

Once passenger is assigned a taxi, one wants to provide an accurate time of arrival in seconds for the customer. This also can be done using a few lines of SQL.

-- lat, lon is where a passenger is present
-- taxi(id, lat, lon, available) current taxi information
-- taxi_id is the assigned taxi
SELECT taxi.id, triptime(lat, lon, taxi.lat, taxi.lon) as seconds_to_arrival
 FROM taxi 
 WHERE taxi.id = tax_id;

Car-pooling Service:

Suppose a taxi company wants to offer a car-pooling service where the taxi picks as many passengers as possible with minimal detours. This means that frequently one needs to compute available passengers and the order in which to pick up them.

The first thing one needs to compute is a the distance matrix between all the waiting passengers. For example, if a driver has to pick up 10 passengers, we need to compute the network distance between every pair of delivery address which is then fed into an optimization algorithm. Now if one super scales the problem to 100 taxis and 1000 passengers, we can handle this query easily using your existing commodity hardware.

We can compute 1000 x 1000 distance matrix (1 million distance computations) is less than 30 seconds on a commodity server running PostgreSQL. A 100 x 100 distance matrix is just a fraction of a second.

-- waiting_passengers(id, lat, lon) is a table of passengers 
-- yet to be assigned to taxis.
SELECT A.id as id1, B.id as id2, dist(A.lat, A.lon, B.lat, B.lon) as dist
   FROM waiting_passengers as A, waiting_passengers as B;
-- Result is a distance matrix

Trajectory Analysis:

Taxi services collect the GPS trajectories of their taxis which becomes a rich source of analysis for optimization efficiency. Look at our detailed post on what can be done with large-scale trajectory archives.

Fare Scams by Drivers:

If you own a fleet of taxis and want to find if a driver is overcharging your passengers, you can compute the difference between the distance along the trajectories vs. the distance from start to destination. Look at our detailed post on what can be done with large-scale trajectory archives.

 

 

One Million Network Distances

To setup our simple benchmark of 1 million network distances, we choose 1000 hospitals and 1000 parks at random and then take their cartesian product (i.e., cross product) to create 1 million unique source and destination pairs. We store these 1 million tuples in a table called performance_dist.

CREATE TABLE performance_dist as
  SELECT foo1.code as code1, foo2.code as code2, z4(foo1.code, foo2.code) as codez4 FROM
  (SELECT * from hospitals limit 1000) as foo1,
  (SELECT * from parks limit 1000) as foo2;

Now that we have created the 1 million sources and destinations, the first variant simply applies the dist() function using a sequential scan of performance_dist.  This query takes 32 seconds which roughly translates to about 33k network distances/second

SELECT dist(code1, code2) from performance_dist;

Second variant uses a B-tree index on code1, code2 and takes 29 seconds, which roughly translates to about 34k distances/second. This is a little better than a sequential scan but not so much better.

CREATE INDEX performance_code1_code2 on performance_dist(code1, code2);
SELECT dist(code1, code2) from performance_dist order by code1, code2;

Third variant is aided by an index scan on codez4 and takes only 25 seconds which is about 40k network distances/second.

CREATE INDEX performance_codez4 on performance_dist(codez4);
SELECT dist(codez4) from performance_dist order by codez4;

How can we improve this further? (a) Really we can be running several concurrent queries per second since PostgreSQL uses one thread for each of these queries (b) We use an xlarge RDS machine in Amazon AWS but should be lot faster on a in-house PostgreSQL installation. Please let us know what kinds of throughput you see on your in-house machine.

Exact K-Nearest Neighbors

In this post we show how easy it is to find the exact K-nearest neighbors for a large number of points on a road network. In this example we use a GiST spatial in Postgres. This example can also be done using without a GiSTindex. Our dataset consists of 49,573 fast food restaurants obtained from http://fastfoodmaps.com/data.html.

Dataset

The schema of the relation and the index on it are as follows:

CREATE TABLE restz4
(
  id integer,
  lat double precision,
  lon double precision,
  code bigint
);

UPDATE restz4 set code = Z2(lat, lon);
CREATE INDEX id_index_key ON public.restz4 USING btree (id);
CREATE INDEX code_index_key ON public.restz4 USING btree (code);
CREATE INDEX lat_lon_index_key ON public.restz4 USING btree (lat, lon); 

-- Create GiST Index
ALTER TABLE restz4 ADD COLUMN gid serial PRIMARY KEY;
ALTER TABLE restz4 ADD COLUMN geom geometry(POINT,4326);
UPDATE restz4 SET geom = ST_SetSRID(ST_MakePoint(lon,lat),4326);
CREATE INDEX idx_restz4_geom ON restz4 USING GIST(geom);
CREATE INDEX idx_rest4_geom ON public.restz4 USING gist (geom);

How to compute exact neighbors?

The strategy to computing the exact K-nearest neighbors is as follows:

  • Use the GiST spatial index index to find the K-spatial nearest neighbors. Spatial distance here refers to geodesic or “crow-flying” distance.
  • For each point, look at the kth spatial neighbor and compute its network distance. Let this distance be denoted by “dist”
  • Obtain all points that lie within a distance of “dist” around the point. A quick dirty way of doing is by performing a window query around each point based on a range on the latitude and longitude. Since one 1 degree of latitude/longitude corresponds to 110kms, we simply divide dist by 110,000 and use it as the side length of a box around each point
  • Compute the actual network distance of all points within this box and choose the K closest.
  • This method is correct since the spatial distance is a lower-bound of the network distance. We are guaranteed to find the K actual network neighbors within a distance of dist.

SQL Query

This query finds the 10 neighbors for each of the roughly 50k restaurants among nearby restaurants.

SELECT foo1.id as id1, restz4.id as id2,
       dist(foo1.code, restz4.code) as networkdist
   FROM (
       SELECT * FROM (
            SELECT y.id as id, y.lat as lat, y.lon as lon, y.code as code, (
                SELECT dist(x.code, y.code) / 110000 as dist -- 110km is roughly 1 degree lat/lon
                   FROM restz4 x
                   WHERE x.gid != y.gid
                   ORDER BY x.geom &lt;-&gt; st_setsrid(y.geom, 4326)
                   OFFSET 10   -- Value of K: Find the 10 spatial neighbor
                   LIMIT 1
                )
            FROM restz4 y
      ) AS FOO
    ) AS FOO1, restz4
    WHERE
        restz4.lat between FOO1.lat - dist and FOO1.lat + dist AND
        restz4.lon between FOO1.lon - dist and FOO1.lon + dist AND
        foo1.id != restz4.id;

Performance

When k = 10, it takes 166 seconds to compute 10 neighbors for each of the nearly 50k restaurants. It makes 1.02 million network distance computations on the road network in the process of computing the nearest neighbors.

Processing GPS Trajectory (Crumbs) on Roads inside any Database

In this post, we will show how easy it is to work with GPS trajectories using our technology. GPS devices are becoming commonplace and are now deployed on many different commercial and non-commercial vehicles. A company operating a fleet of taxi usually has a GPS installed in all of their vehicles. This gives the operator an exact idea of the location of their vehicle, which is very useful. For instance, when a customer requests a ride, the  taxi operator will use the current location of a vehicle to send the nearest vehicle to the customer but this is the most obvious customer facing application of GPS. There are more complex analysis that an operator will want to perform from historical (i.e., a day/week/month/year) worth of GPS information collected from vehicles which can shed light several aspects of their businesses.

An operator with historical GPS record can ask many interesting questions from the dataset.  For instance, a taxi operator can ask the following questions:

  • What is the average taxi ride?
  • How far on average does a taxi has to travel before it gets a ride?
  • How much on average does a taxi travel per day (or month)?
  • Are my drivers taking circuitous routes to run up the meter?
  • Which taxi in my fleet travelled the most distance?
  • Is there a taxi with a broken GPS device?

In the following blog post, we show how these queries can be written in a few minutes using our technology. All the operator has to do is to store the dataset in a database (actually any database but we will use Postgres) and can then write all these queries in a few minutes.

Dataset

Our taxi Dataset is from obtained from San Francisco Yellow Can which provide the GPS information for 500 of their cabs using a publicly available website (http://cabspotting.org/api).  We use the dataset collected by these good folks at Dartmouth (http://crawdad.cs.dartmouth.edu/~crawdad/epfl/mobility/). They have collected one month worth of dataset for 500 taxis and made it available online for download.

We obtained this dataset and made a few small changes to it so that we can load it into Postgresql database. The format of each data point is as follows:

  • id => crumb_id (unique key, we added)
  • taxiid => each taxi has a unique id
  • tripid => Globally unique trip id (unique id for trip which we added)
  • lat => latitude in degrees
  • lon => longitude in degrees
  • occupancy => shows if a cab has a fare (1 = occupied, 0 = free)
  • ts => time is in UNIX epoch format when this tuple was recorded

An example tuple is as follows: [id, taxiid, tripid, lat, lon, occupancy, time], e.g.: [112133, 1, 422, 37.75134, -122.39488, 0, 1213084687]. A taxi periodically stores a GPS point on the map. One assumes that the taxi took the shortest path from one GPS crumb the next. So reconstructing the trip would involve ordering the points by their timestamp (ts) (or equivalently by their id since we assign id by increasing timestamp), obtaining the distance between successive points and adding up the distance.

We first create a taxi table with the following schema and load the dataset.

CREATE TABLE taxi (
  id        bigint,
  taxiid    int,
  tripid    bigint,
  lat       float,
  lon       float,
  occupancy int,
  ts        bigint
);

Next, we convert the latitude longitude to a code by adding an additional attribute called “code” by applying the Z2 function.

    alter table taxi add column code bigint;
    update taxi set code = z2(lat, lon);

We create a few standard B-tree indices on the table

     ALTER TABLE taxi ADD PRIMARY KEY (id);
     CREATE INDEX taxi_taxiid on taxi(taxiid);
     CREATE INDEX taxi_tripid on taxi(tripid);
     CREATE INDEX taxi_latlon on taxi(lat,lon);
     CREATE INDEX taxi_occupancy on taxi (taxiid, tripid) where occupancy = 1;
     CREATE INDEX taxi_ts ON taxi (ts);

Function to Compute Trip Distance

Trip distance is the distance of a trip taken by a customer. It involves a few steps:

  1. Extract all trajectory points of a given trajectory denoted by tripid
  2. Create an ordering of the points
  3. Compute road network distance between consecutive points
  4. Add the distances to produce the network distance of the trajectory

We now explain the SQL queries for each of these steps and final provide the final function that takes a tripid as input and produces the network distance of the corresponding trajectory.

We create a simple function called trip to extract all the points corresponding to a given tripid. It is a simple select on the taxi table.

CREATE OR REPLACE FUNCTION trip (which_trip_id bigint)
RETURNS TABLE(
    id bigint,
    taxiid int,
    tripid bigint,
    lat float,
    lon float,
    occupancy int,
    ts bigint,
    code bigint
   ) IMMUTABLE AS
$$
BEGIN
         RETURN QUERY SELECT taxi.id, taxi.taxiid, taxi.tripid, taxi.lat, taxi.lon,
                             taxi.occupancy, taxi.ts, taxi.code
         FROM taxi
         WHERE taxi.tripid = which_trip_id;
END;
$$ LANGUAGE plpgsql;

Next, we order the points such that point “id” is followed by “id +1” (one can do this without the aid of the id field by simply ordering points by increasing timestamp)

   SELECT t1.id, t1.code, t2.id, t2.code, dist(t1.code, t2.code)
     FROM trip(500) AS t1, trip(500) AS t2
   WHERE t1.id = t2.id - 1;

Then we compute the length of a trip by simply adding the network distance for each segment of the trajectory using the dist function.

           SELECT SUM(dist(t1.code, t2.code))
             FROM trip(500) AS t1, trip(500) AS t2
           WHERE t1.id = t2.id - 1;

This whole process can be packaged as a tripdist function that takes a tripid as input and produces the network distance of the trip.

CREATE OR REPLACE FUNCTION tripdist (which_trip_id bigint)
   RETURNS TABLE(tripdist float) IMMUTABLE AS
  $$
   BEGIN
     RETURN QUERY
         SELECT sum(dist(t1.code, t2.code))
           FROM trip(which_trip_id) AS t1, trip(which_trip_id) AS t2
         WHERE t1.id = t2.id - 1;
   END;
  $$ LANGUAGE plpgsql;

To create a useful function that given a tripid produces an ordered set (taxiid, tripid, occupancy, tripdist). This way a whole trajectory of all the points is collapsed to taxiid, tripid, occupancy and the network distance (tripdist).

CREATE OR REPLACE FUNCTION tripinfo (which_trip_id bigint)
   RETURNS TABLE(
      taxiid int,
      tripid bigint,
      occupancy int,
      tripdist float
   ) IMMUTABLE AS
  $$
   BEGIN
     RETURN QUERY
          SELECT taxi.taxiid, taxi.tripid, taxi.occupancy, tripdist(taxi.tripid)
               FROM taxi
          WHERE taxi.tripid = which_trip_id
          GROUP BY taxi.taxiid, taxi.tripid, taxi.occupancy;
   END;
$$ LANGUAGE plpgsql;

Queries

List of all trips made by a taxi

   SELECT tripinfo(tripid) FROM taxi where taxiid = 1;

How many KMs did each of the cars travel with passengers?

Change occupancy to 0 for the distance taxi’s travel without passengers. This query takes less than a second (about 700 ms) for taxiid = 1.

     SELECT sum(foo.tripdist)/1000 as distance_in_km FROM  -- Divide by 1000 to get KMs
        (SELECT (tripinfo(tripid)).tripdist as tripdist
          FROM taxi WHERE taxiid = 1 and occupancy = 1
          GROUP BY tripid) as foo;

Average distance they travel without passengers?

Average trip length is 7250 meters and it takes 197 seconds.

  SELECT avg(tripdist) FROM
     (SELECT tripdist(tripid) as tripdist FROM
           (SELECT tripid from taxi
              WHERE occupancy = 1
              GROUP BY tripid) as foo
     ) as foo1;

Order all of the 500 taxis the distance in Kms they covered in the month

This query computes the network road distance for all the trajectories for all the 500 taxis and then orders them in a decreasing order of distance. This query takes 295 seconds.

   SELECT taxiid, sum(tripdist)/1000 as distance_in_km FROM (
      SELECT taxiid, (tripinfo(tripid)).tripdist as tripdist
        FROM taxi
        WHERE occupancy = 1
        GROUP BY taxiid, tripid
        ) AS foo
   GROUP BY taxiid
   ORDER BY distance_in_km desc;

Looking at the result immediate reveals problems with two of the Taxis. While all the 498k taxis travelled less than 10k kms in a month, the top two taxi 518 and 494 had over 92k and 46k kms which seems incorrect.
In order to further investigate this we issue the following query.

SELECT (tp).tripid, (tp).tripdist FROM
 (SELECT tripinfo(tripid) AS tp FROM taxi WHERE taxiid = 518) AS foo
  ORDER BY (tp).tripdist DESC;

There are couple of issues with this taxi. There are a bunch of tripids with tripdist as NULL. We found two reasons for this. The GPS points were being put deep inside the pacific ocean or the trajectory only contained just one point and hence we could not compute the distance.

One trip id is interesting tripid = 894359 has a network distance of 1579301.4 meters. This is huge. We found that the reason for that is the the GPS keeps putting points that bounces between palo alto and near golden gate bridge, about 50 kms apart. The query we used is given here.

         SELECT t1.*, t2.*, dist(t1.code, t2.code) AS segdist
           FROM trip(894359) AS t1, trip(894359) AS t2
         WHERE t1.id = t2.id - 1 ORDER BY segdist desc;

This example shows how our technology can be used to get rid of erroneous GPS measurements before analytics can be performed on the trajectory.

Did my cabie take a huge detour?

To see if the cabie has been taking detours, we need to first figure out how to compute the direct distance between the first and last points in the trajectory. We define the following function, tripdist_srcdst.

CREATE OR REPLACE FUNCTION tripdist_srcdst (which_trip_id bigint)
  RETURNS TABLE(
     tripdist float
  ) IMMUTABLE AS
$$
BEGIN
    RETURN QUERY
          SELECT dist(t1.code, t2.code) FROM
              (SELECT * FROM trip(which_trip_id) as t order by (t).id LIMIT 1) as t1,
              (SELECT * FROM trip(which_trip_id) as t order by (t).id DESC LIMIT 1) as t2;
END;
$$ LANGUAGE plpgsql;

Now we can compute the difference between direct from source to dest vs. driver’s routing. For instance, we can compute the detour for tripid = 500. In this case, the direct distance 2104.6 vs. the cabbie’s route is 2325.1 meters. In other words, driver drove 200 meters more.

    SELECT tripdist_srcdst(500), tripdist(500);

This begs the question, are there some cabbies take a lot of circuitous routes. In particular, we are not interested in 200 meter detours but rather routes that are more than 5 kms longer than the direct path from source to destination. We use the following query to obtain the top 100 routes with the maximum detour. We limit the trips to those that are 50 kms.

 SELECT taxiid, tripid, tpsd, tp, (tp - tpsd) as diff FROM
        (SELECT tripdist(tripid) as tp, tripdist_srcdst(tripid) as tpsd, taxiid, tripid
           FROM taxi
           WHERE occupancy = 1
           GROUP BY taxiid, tripid
         ) as foo
     WHERE tp is not null AND tpsd is not null AND tp &lt; 50 * 1000 AND tpsd &lt; 50 * 1000
  ORDER BY diff DESC
  LIMIT 100;

We immediately realize by looking at the output of this query is that cabbies make a of
loops (the taxi makes a huge loop and comes back
to the same point. Just like airport shuttles). So which are the taxis that detour too much, what is the average detour?

  SELECT taxiid, count(*) as num_trips, avg(tp-tpsd) as average_detour FROM
      (SELECT tripdist(tripid) as tp, tripdist_srcdst(tripid) as tpsd, taxiid, tripid
         FROM taxi
         WHERE occupancy = 1
         GROUP BY taxiid, tripid
       ) as foo
   WHERE tp is not null and tpsd is not null and tp &lt; 50 *1000 and tpsd &lt; 50 * 1000 AND tp &gt;= tpsd
   GROUP BY taxiid
   ORDER BY average_detour desc;

The result of this query is interesting. Taxi 518 had 1082 trips and had a average detour of 14 Kms. But we know that this is the taxi with a buggy GPS. Now Taxi 19 made 594 trips 2.978 kms detour, which is roughly 1800 kms more than direct routes. May be this cabbie does a lot of carpool rides. Something to examine more.

Note that such a free form exploration of spatial datasets on road network is only possible since it is so simple to express queries using SQL inside a database. Imagine if for each of the query you had to program quite a bit to get answers.

Where are the nearest taxi?

Finding a taxi involves finding a taxi that is finding the closest in terms of location and also in time. In a live system the time is the current epoch timestamp. The following query finds the nearest taxi with hard coded location (37.75, -122.4) for the customer and time when the customer requested (1211840888). This query takes 145 ms.

SELECT * from (
        SELECT taxiid, last(occupancy) as taxi_occupancy,
               last(lat) as taxi_lat, last(lon) as taxi_lon,
               last(code) as taxi_code, dist(last(code), z2(37.75, -122.4)) as taxi_distance
          FROM (
               SELECT * FROM taxi WHERE
                   ts BETWEEN 1211840888 - 10 * 60 and 1211840888 -- Extract taxi's position with 10 minute tim window
               ORDER BY TS) as foo
            GROUP BY taxiid) as foo1
   WHERE foo1.taxi_occupancy = 0 -- Choose that are currently available
   ORDER BY taxi_distance -- Order by distance to where location is made
   LIMIT 10;

Want to try this demo?

Try our article here for the oracle and the datasets.