case_studies celery microservices Tanzu RabbitMQ

An Architecture for Real-Time Geo-tracking with Python, Celery, RabbitMQ, and More

Geo-ArchitectIf you look at a map online and someone asked what you see, most people will answer, “a map.”

That’s not the answer you would get from Ragi Burham. He sees the underlying geometry, vectors, styling, raster data, GUI controls, application logic, and data store. That’s because Burham is an expert in the area of mapping solutions, and his background includes over a decade of geo-related work with organizations like ESRI, NIMA, National Geographic, Navteq, TeleAtlas, and his own company, AmigoCloud.

At Pivotal, we get excited about people like Ragi Burham—an open source minded software engineer with a new way of doing things. At PyCon in March of 2013, Burham shared a talk titled, “Realtime Tracking and Mapping of Geographic Objects using Python”—it’s another way to look at geography-based big, fast data architectures.

Below, we recap the “Normal” and “Real-time” architectures that Burham shared in his talk. These two approaches use OpenLayers, Varnish, Nginx, Gunicorn, Django, TileStache, Memcached, Mapnik, PostgreSQL, PostGIS, Socket.io, Node.js, Celery, and RabbitMQ. Much like our recent post on scaling social media with Celery and RabbitMQ, Burham adds these two components along with Socket.io and Node.js to help his real-time architecture work in a fundamentally differently way and scale.

If you have 26 minutes, you can also view the video below.

cta-download-rabbitmq

A “Normal” Geo-Based Architecture

Beyond an explanation of underlying geometry, vectors, styling, raster data, and architecture, Burham covered some of use cases and requirements driving the specifics of what geo-based systems do—determining the amount of information on the map, omitting certain types of mapping objects, making calculations like speed or direction, alerting that something has crossed a boundary, and more.

He began outlining his “Normal” GeoStack architecture by listing the most popular open source data sets and covering some of the pros and cons of OpenStreetMap, NAIP, SRTM, and Natural Earth Data. With the data sources covered, he filled in the details of the architecture. This included a client with OpenLayers, a JavaScript library that helps with rendering maps (another option is Leaflet). Varnish was used in front of servers with the following components: Nginx, Gunicorn, TileStache, Memcached, and Mapnik. Then, a PostgreSQL database stores the map information with the PostGIS spatial database extension.

Because rendering style can take a lot of time, Varnish is used for speeding things up and can also be used as a reverse proxy. For those unfamiliar, TileStache is used to serve map tiles based on rendered geographic data the Mapnik toolkit is used for map development. Here is a screen shot of his normal architecture:

normal-geostack

The Challenges with Real-Time Geo-Based Architectures

One of the key challenges with this type of architecture is when you need to get state from the database. Clients end up polling for many real-time tracking scenarios. This presents a problem when, for example, you have 1000s of vehicles confirming state every second. Polling the database at this rate can create an overwhelming and unnecessary amount of traffic. And it doesn’t scale.

But, what if the server could notify a client? To evolve the architecture, Burham proposes four additional components—Node.js with Socket.io and Celery with RabbitMQ.

real-time-geostack

By adding Socket.io to the client, we get a way to use websockets across browsers and allow data to be transported bi-directionally. We can send information from the server back to the client specifically and only when it is needed. Varnish changes purpose slightly in this architecture—it still does the HTTP caching, but now it also acts as a reverse proxy. You can configure small lines of code on Varnish to deal with traffic differently. For example, if websocket traffic comes in, you can tell it to go to server A, and, if other types of traffic come in, use server B. In this case, Node.js is connected with Varnish and offers a way to push or publish notifications back out to clients.

Now, we need a way to take an incoming client’s geo-position update, asynchronously decide if some alert should happen, and distribute a notification back out for only certain clients. This is where we add Celery and RabbitMQ to the architecture. With these two components, we can support asynchronous tasks and queues. If you aren’t familiar with Celery, it is a task queue based on distributed message passing. Your code can call Celery tasks to execute concurrently and operate in either asynchronous or synchronous modes. This is great for geographic use cases like crossing a boundary or fence—when something is entering or leaving a polygon. You don’t need to calculate when an update happens in real-time, you can receive a task, make a calculation, and decide to notify others later. With the addition of RabbitMQ (Celery’s default broker), we then transform these tasks and calculations into a Pub/Sub channel where Node.js is listening. Whenever an appropriate event is triggered, we can push a message down the channel, and Node.js forwards the information back to the appropriate clients using Socket.io.

Additional Reading: