At a recent (as of time of writing, Winter 2005) Sydney Linux User Group meeting I couldn't help noticing how many people were completely sure that PostgreSQL was unsuitable for a geographic informations system database despite having never actually made any attempt to give it a try. Presumably I must have got drunk the night before the "PostgreSQL Sux" class in database theory and slept through that morning. Well, having no strong theoretical basis to fall back on, and being too ignorant to be sure that it couldn't possibly work, I figured, "Why not see what reality has to say on this issue?".
I probably should have a kick-arse database server round the house for these tests but every time I think to buy one, something better is just around the corner. I used my laptop for the test, here's the specs:
| CPU | Celeron Mobile 2.5GHz |
|---|---|
| Level 1 cache | 256 kB |
| Main RAM | 256 MB |
| Hard Drive | Toshiba MK3021GAS (4200 RPM, 12 ms seek) |
| IDE interface | Intel Corp. 82801DBM (ICH4-M) |
| Kernel | Linux version 2.6.10-1.766_FC3.root |
| Database | PostgreSQL 7.4.6 |
| Filesystem | ext3 |
| Time to md5sum 1073741824 byte file | 63.405s (16934655 bytes/s) |
| Time to copy 1073741824 byte file | 190.879s (5625248 bytes/s) |
Sure, there are better system out there, but this is about the index algorithm which should make best use of the system it gets given.
This is a tricky one. The nature of the test data has a substantial effect on the R-Tree index performance (unlike a B-Tree which is almost unaffected by the data). In some ways, this is a limitation of the R-Tree algorithm, in other ways, if it works for "real world" data sets then that's all that matters. The index is working on "box" data (where each box consists of two points on diagonally opposite corners) and if a large number of boxes are pretty much stacked on top of each other then the index can't distinguish between them in any meaningful manner. The result is pretty much the same as no index at all.
I wanted to achieve a mix of box sizes from small through to large and also to scatter those boxes around. Here's my generation algorithm (in perl):
$scale = exp( rand( 14 ));
$x = rand( 1000000 );
$y = rand( 1000000 );
$dx = rand( $scale ) + 1;
$dy = rand( $scale ) + 1;
$x -= $dx / 2;
$y -= $dy / 2;
$x1 = $dx + $x;
$y1 = $dy + $y;
$x = POSIX::floor( $x + 0.5 );
$y = POSIX::floor( $y + 0.5 );
$x1 = POSIX::floor( $x1 + 0.5 );
$y1 = POSIX::floor( $y1 + 0.5 );
# Final box is: (($x $y) ($x1 $y1))
Note that there are some arbitrary factors in there, for one, everything is converted to an integer. That just makes it a bit easier to read, no real importance there. The important ones are the (x,y) uniform distribution over 1 million by 1 million and the comparable upper limit on the scale factor which is exp(14) or about 1.2 million. This algorithm results in a "power law" type of distribution with mostly small boxes ranging down to a handful of large boxes.
This is pretty simple, just about the most simple thing you could make that might be useful. Here is the schema:
CREATE TABLE landmark (
name character varying(255) NOT NULL,
loc box NOT NULL,
description character varying(255) NOT NULL
);
CREATE INDEX landmark_ix1 ON landmark USING btree (name);
CREATE INDEX landmark_ix2 ON landmark USING rtree (loc);
So it maps space to names and back again with the description thrown in for the ride. Actually, in my tests the name was a randomly chosen "English-like" word of approximately 10 characters while the description was always blank. Since the B-Tree doesn't much care about the data that it is indexing, the exact contents of the name shouldn't matter.
Using the ? notation for the data to be filled in, the queries were:
SELECT COUNT(*) FROM landmark WHERE ?::box && loc; SELECT COUNT(*) FROM landmark WHERE ?::box ~ loc;
Checking the "explain" on these showed that they always used the R-Tree index (which seems obvious really, but it never hurts to check). The box for the query was generated by the same method as the box for the inserted test data.
Two tests were done, one with the "overlap" operator (called && in PostgreSQL) and one with the "contains" operator (called ~ in PostgreSQL). At the large end, both operators are pretty similar in performance, at the small end, the containment operator frequently gets no results at all (which can't be graphed on my log-log scales) and in general the containment operator gets less results than the "overlap" operator. All told, the timings for the two operators are fairly close.
I pushed the dataset size up to 1/2 million records which is probably enough for a rough street map of Sydney. Something like 2 million would give a reasonably good street map of Sydney, 5 million would be quite a detailed map. I had a peek in the back of the street directory and found around about 50000 streets in Sydney so my test data would be enough for 10 key regions per street (on average, some streets are longer than others). If this were increased to 50 key regions per street plus another 50 non-street landmarks (schools, parks, internet cafes, etc), we would have quite a nice map indeed.
My laptop was able to achieve better than 200ms query response for most of the small to medium queries (and the large queries we would expect to be slower), and this was on a 1/2 million record database. Something considerably faster is going to be required to handle the 5 million record database. By the looks of the trend, a 5 million record database would be giving query results in under 10s on my laptop. Not good enough for drawing a map in real time but not ridiculously out of the ballpark either.
I noticed from a bit of trial and error that the famous "VACUUM ANALYZE" command can make a difference to the R-Tree index building process. The best effect is to load small random selection of records into the table, then run "VACUUM ANALYZE" then load a bunch more and run "VACUUM ANALYZE" again. It seems that part of the vacuum process will also be annotating the index and thus improve future inserts.
This is just based on gut feeling here, I didn't rebuild the test data over and over again to try different approaches (e.g. I didn't try loading the data into an unindexed table and adding the index afterwards for comparison).