Just another WordPress.com site

Strategies for Massive Data Sets

 

Sending hundreds of thousands (or millions) of vector geometries through a web call is a lot of overhead, and not surprisingly often slow or impossible to render.
It is usually a pretty easy argument that cramming millions of viable vectors into a monitor isn’t that useful to the poor set of eyeballs waiting to make sense of it.  But there are times when tons of data really do need to be distilled in some visual way and the ‘that’s just crazy’ defense won’t do.
We have a pile of strategies for massive data sets that are possible with Visual Fusion; and we’ve found that these strategies, or combinations thereof, have worked well for customers over the years…

  • ZOOM LEVEL SHOW/HIDE
  • GENERALIZATION
  • CLUSTERING
  • THEMATIC DIVISION
  • RASTER/VECTOR SWAP-OUT
  • BOUNDARY DISSOLVE
  • CENTROID PROXY
  • HEATMAPS

 

ZOOM LEVEL SHOW/HIDE

Common in desktop GIS software and pretty much most navigable digital maps, layers are configured to simply not render at zoom levels where their level of detail is not required or is burdensome.

This is the most basic method available to avoid data-volume blowout and is the simplest to configure, as it requires little strategy and no pre-processing.

But it isn’t very cool and often clients really do need to get a visual sense of even massive data sets so hiding them in that case won’t do.

 

GENERALIZATION

This is usually the first pass at minimizing the size of the vector data set while maintaining the visual and interaction benefits.  Fewer bends in the shape = faster.

The detailed perimeter of Kentucky on the left is way more resolution than is necessary.  A good rule of thumb: if more than one bend is represented in the same monitor pixel, then it is a waste -that level of resolution exceeds the monitor’s ability to render it.

The generalization algorithm used by IDV is a powerful one and can be tied to zoom level so that zooming in fetches progressively less generalized shapes –detail on demand. 

However, truly massive data sets can swamp generalization and are better handled in some conjunction with some of these other methods.

 

CLUSTERING

It’s clustering.  Everybody knows what clustering is.  Plus, there is a lot to talk about here that I’ll save for another really interesting post.

THEMATIC DIVISION

If a data set is massive enough to have millions of records, then it is a great candidate for thematic division.  Divide and conquer.

Thematic mapping is such a powerful tool, the performance gains seen by filtering the number of map elements by the data is really just a happy side effect -but a quick win.  Reducing a massive data set to a few premium value slices that are active by default goes a long way toward easing network calls.  The user can always turn more of the slices on, but the data is sent in more manageable packets.

Rendering performance, however, is still going to be a drag when all of the layers are on.  Which leads me to…

 

RASTER/VECTOR SWAP-OUT

This has been the most popular and broadly beneficial method used by IDV clients with excruciatingly ginormous data sets.

By calling the data source as a rater bitmap when the user is at broader zoom levels (particularly in a tiled method), the network overhead and client-side rendering is dramatically improved.

Then, at closer zoom levels where the volume of polygons is fewer, that layer can be rendered as interactive vectors.  The zoom level at which the layer switches from raster to vector is one that can be fine tuned to optimize performance.

The down side of raster bitmap layers is that you lose the immediate rollover and click interactions that vectors facilitate.  Though, when there are upwards of a million polygons on a single map, the human benefit for individual interaction is negligible; the viable use case of that volume of data is the broad, at a glance, visualization.  And, still, zooming in on the area of interest will eventually provide an interactive vector version.

Can I see additional performance gains by pre-processing the image layer ahead of time?  Yes, but you will loose the "freshness" of thematic elements in the data that would be baked into a stale pre-processed version.

 

BOUNDARY DISSOLVE

You can pre-generate N number of vector data layers that are geographically dissolved based upon some meaningful attribute.  The more aggregated versions are shown at higher zoom levels.  For example, when fully zoomed out, the United States is a shown as fifty state polygons.  Zooming in enumerates counties, zooming in more breaks out smaller administrative districts, and so on.

Many map visualizations do this for nested boundaries as par.

 

CENTROID PROXY

The network and client side muscle required to deliver points is much less than is the case for polygons -one coordinate per feature rather than an unholy list of coordinates.  At broader zoom levels, centroid points can represent polygons by proxy (which can also take advantage of clustering).

Centroid proxy is also the baseline of a particularly nice cartographic result if the points are given visual thematic variables like size or color or both (color shading by polygons themselves, like rates of violence per country, is visually unfair because larger countries bias the eye).  So Centroid Proxy can be a win-win-win.

 

HEATMAPS


Source: www.istartedsomething.com, via Danyel Fisher

The human eye is not well suited to process millions of simultaneous anythings at an individually meaningful level.

If the data set has an attribute of a non-discrete nature (its a geospatial no-no to interpolate non-continuous data like population), like elevation or temperature, then you can pre-generate a raster heatmap representation of this layer where each point represents a value and the pixels between them are assigned an interpolated value.

This data set can then be fetched as a tiled raster source with comparatively tremendous network and rendering performance.  Plus heatmap is a cool buzzword and you’ll be popular around the office as a cutting edge go-getter who’ll settle for nothing less than heatmaps.  Heatmaps.

 

So, to those folks out there who want to see millions of features in an interactive map visualization, my advice is don’t.  To those who absolutely need to see millions of features in an interactive map visualization, consider one or more of the methods above.  Happy cramming!

John Nelson / IDV Solutions / john.nelson@idvsolutions.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s