Realtime Data API w/ Kafka - Trevor PolischukTrevor Polischuk

For years, counting page views, PDF, and XML downloads at PLOS was difficult. Our methodologies were antiquated, and prone to high cost failures. After auditing our old system, a tangled web of cron jobs, scripts that batch processed log files, and a very old Drupal application, we decided it was best to take it out behind the barn.

With a clean slate, we launched a new system in February that employs a modernized pipeline for counting page views, a new API for passing the data between our applications, and a real-time UI implementation that cuts time to first ALM from 48 hours to minutes.

Before we started writing a line of code, we first needed to answer the fundamental question: What is a page view?

While the answer seems obvious, our soon to be robot overlords make the answer very difficult. It is estimated that over half of all web traffic are automated bots, scripts, and terminators sent from the future flying from link to link. For our purposes, we wanted to do our best to eliminate these from our counts. To do so, we’ve switched from pure logfile parsing to a similar methodology employed by web analytics behemoth, Google Analytics.

This switch eliminated almost all of the bot traffic from the counts. We no longer rely on the honesty of an automated script reporting itself in the User Agent string, but instead require the JavaScript on the page to be executed, which most bots out of pure efficiency won’t do.

Second, we switched from a batch processing system to a real time streaming architecture. To do so, we employed the open source Apache Kafka distributed data streaming platform in order to process one message at a time, rather than 24 hours of logs every night.

Finally, with substantial help from python wizard Sebastian Bassi, we replaced an ailing Drupal application and created a lightweight API using Flask and Swagger, enabling our internal services to consume the real time data.

Now when you view the metrics on an article, you’re viewing the latest and greatest statistics. As a side effect of this new architecture, we now collect and display the view metrics within seconds of the article being published, instead of waiting for the nightly job to run, which often delayed any metrics from showing for almost 2 days. Pretty cool!