Uncategorized | That's all very well, but…

This article originally posted at http://engineering.conversantmedia.com/2015/08/24/using-kafka-to-distribute-and-dual-load-timeseries-data/

At Conversant, we love OpenTSDB.

Operations teams, software engineers, data scientists, QA, and even some on the business side scrutinize these colorful graphs daily. OpenTSDB allows us to effectively visualize the various metrics collected from our hundreds of servers across several data centers.

Numerous home-grown scripts have been developed to mine the data for engineering reports, alerting and quality through the RESTful API. We screen-scrape the PNGs for internal dashboards. More recently OpenTSDB has taken on the task of looping some of this data back into the production chain (much to the chagrin of engineering management working to replace this bit!).

Once a pet-project, a closely-guarded engineering-only tool, it has since grown into a respectable dozen-node cluster with strong HA SLAs persisting over 550 million data points per day.

Room for Improvement

Last year we set out to perform a major upgrade of the cluster. Load balancing was introduced (via HAProxy) and an external CNAME created to provide some measure of indirection and flexibility. The original hope was to build a brand new cluster from scratch and port the historical data over.

Unfortunately this wasn’t practical – it would take several days to copy the data and there was no easy way of back-filling the missed events once the initial copy had completed. Instead we opted to take the cluster down to upgrade the Hadoop stack (an older version that didn’t support rolling upgrades), and left the OS upgrade for later.

The nature of the metrics collection scripts – a loose assemblage of python and bash scripts – meant that ALL metrics collection would cease during this planned outage. The collection scripts would of course continue to run, but the events would simply be dropped on the floor when the cluster wasn’t available to persist them.

This was clearly less than ideal and an obvious candidate for enhancement.

Conversant needed a solution that would enable taking the cluster down for maintenance while continuing to buffer incoming events so they could be applied when work was complete. Additionally, I wanted to build a backup system for DR purposes and for A/B testing upgrades, or potential performance or stability enhancements at scale. A secondary instance would also be useful for protecting the primary cluster from “rogue” users making expensive requests while engineering is doing troubleshooting. The primary cluster would be kept “private” while the backup was exposed to a wider audience.

A Use Case for Kafka

This seemed like a perfect use for Kafka – a distributed, durable, message queue/pub-sub system. In fact, this specific issue brought to mind an article I’d read a few months back on O’Reilly’s Radar blog by Gwen Shapira. In the article, Shapira discusses the benefits of inserting Kafka into data pipelines to enable things like double loading data for testing and validating different databases and models.

Kafka could be inserted into the flow – sending all metrics into Kafka where they would be buffered for consumption by the OpenTSDB cluster. Most of the time this would function in near-realtime with the cluster consuming the events nearly as fast as they are produced.

However, should the cluster become unavailable for any reason, Kafka will happily continue to buffer them until service is restored. In addition, a separate backup cluster could be built and concurrently loaded by consuming the same event stream. In fact, nothing would prevent us from setting up separate kafka topics for each component of the stack enabling selective consumption of subsets of the metrics by alternative systems.

The Conversant data team has been working with Apache Kafka since late last year, delivering and administering two production clusters. The largest of these handles billions of log records every day and has proven to be rock solid. There was no doubt the cluster could broker metric data reliably.

Simple topic/key design

The initial plan was to publish the TSDB put commands from the current scripts directly into kafka via the kafka-console-producer. Though the easiest and fastest way to prove this out, it would negate some of the benefits of the abstraction.

A simple design was devised instead for a set of topics and keys to represent the metrics. Each component in the stack pushes to a separate component-specific topic. The metric name is used as the key for each message. The message payload is essentially everything else: the timestamp, tags, and metric value. For now these are left as the bare strings. A future enhancement may include packing these into a custom protocol buffers object or JSON structure.

Future

By introducing HBase-level Snappy compression on the tsdb table, implementing a more practical TTL of 2 years, and performing a major compaction, there’s more than enough room to make it possible to split the single cluster into a separate primary and secondary. Other groups are already interested in tapping into this stream – either directly from Kafka or through the new “public” OpenTSDB service. Work on standardizing the metric collection and publishing process code will start soon, providing a more maintainable codebase to support future enhancements and growth. There’s even a possibility of tapping into the Kafka event streams directly using a custom consumer for things like monitoring for discrete critical events.

This new architecture provides greater flexibility, availability, and redundancy. At Conversant, we love OpenTSDB + Kafka.

This article originally posted at http://engineering.conversantmedia.com/2015/07/06/on-becoming-functional/

The final verdict may yet be out but the trend is quite clear – imperative style is old and busted, functional is the new black; at least as far as “big data” is concerned.

This isn’t a book review, though it was prompted by reading the concise and thought-provoking book “Becoming Functional” by Joshua Backfield. Backfield’s book does a good job of introducing functional concepts to imperative programmers and I definitely recommend this quick read to other Java developers considering making the transition. Conversant engineers will find a copy to lend sitting on my desk.

Concise code

XKCD: By the year 2019 all information will be communicated in this clear and concise format — xkcd.com “Tall Infographics”

I’ve been an imperative style coder for as long as I can remember (and far before I’d ever heard the term ‘imperative’), working almost exclusively in Java since well before Y2K. While I wouldn’t be shocked if over time I grew to find functional code easier to read and clearer in intent, at this point I have a difficult time appreciating what others apparently take as a given. Article after article pronounces functional code as obviously clearer – displaying code as if it’s self-evident that functional trumps imperative on readability. Although this certainly may be true in some – or even many – cases, I don’t find it universally so.

Another excellent book, Cay Horstman’s “Scala for the Impatient,” follows suit. Take for example, this passage from Horstman’s book:

The Scala designers think that every keystroke is precious, so they let you combine a class with its primary constructor. When reading a Scala class, you need to disentangle the two.

I find this logic in some ways backward. Modern IDEs are exceptionally good at generating boilerplate automatically. This dramatically limits the amount of finger typing required already. By further compacting the syntax, what you save in upfront keystrokes may be paid out again by other developers (even you) in the time required to “disentangle the two.”

Reductio ad absurdum

Taken to the extreme, one could argue that a great way to cut down keystrokes – perhaps more than the concise constructor syntax – would be to limit name lengths. Why not keep our method names down to a single character?

class A ( b: String ) {
def c: Int = { ... }
}

This is clearly absurd and an unfair comparison as there’s no amount of language knowledge that would provide hints as to the purpose of the method simpy by viewing the API. However, my point is simply that conciseness != readability in every case – especially for the less experienced developer, or those new to a given language.

Recursion

The other area that concerns me is the preference of recursion over iteration. In order to maintain immutability, recursion becomes a necessity. However, it certainly isn’t natural for me to think and write recursively. I’ve had to spend significant time and expend some effort to start to see recusion where iteration appears the obvious quick and easy way. Although I’m confident I can ultimately work effectively in this paradigm, I’m concerned this will significantly limit our pool of potential new recruits. I think Joel Spolsky puts it well in his Guerilla Guide to Interviewing:

Whereas ten years ago it was rare for a computer science student to get through college without learning recursion and functional programming in one class and C or Pascal with data structures in another class, today it’s possible in many otherwise reputable schools to coast by on Java alone.

Take this example from Backfield (p59), the iterative solution:

public static int countEnabledCustomersWithNoEnabledContacts (
    List<Customer> customers){
 int total = 0
 for(Customer customer : customers) {
  if (customer.enabled) {
   if (customer.contacts.find({ contact -> contact.enabled}) == null) {
    total = total +1
   }
  }
 }
 return total
}

And the tail-recursive functional version:

def countEnabledCustomersWithNoEnabledContacts(customers : List[Customer],
    sum : Int) : Int = {
 if (customers.isEmpty) {
  return sum
 } else {
  val addition : Int = if(customers.head().enabled &&
                          (customers.head().contacts.count(contact =>
                            contact.enabled)) > 0 { 1 } else { 0 }
   return countEnabledCustomersWithNoEnabledContacts(customers.tail.
                                                     sum + addition)
 }
}

The ‘issue’ with the iterative solution per Backfield (and other functional evangelists) is the mutable variable ‘total’. Although I’ve been burned innumerable times by side effects, never can I recall something as generally benign as a mutable local counter causing hard-to-diagnose issues. Side effects due to mutability are a real problem – but my experience with these problems is limited to mutable collections with elements being changed, or member variables being updated in a method that doesn’t advertise the fact that it might happen. Fear of mutable method scoped primitives seems irrational to me.

And in the example code above I argue the cure is significantly worse than the disease. We’re replacing relatively straightforward iterative code with something that to me (and many I suspect) appears far less clear. This comes with the added benefit of running slower and being less efficient! Ah, but of course in languages like Scala the compiler takes care of replacing this recursion under the covers with what? Iteration! However, you only get the benefit of iterative processing if you know the proper way to structure your tail recursion. Fail at that and you wind up with a bloated stack.

Why on earth shouldn’t I just write it iteratively in the first place and save the compiler – and the next maintainer – the trouble?

Anonymous functions and long call chains

When reading books and articles on functional programming, one can’t help but run across examples including long chains of function calls or multiple levels of nested anonymous functions.

This particular line from Backfield (p88) is a good example of the kind of code that makes me reach for my reading glasses:

Customer.allCustomers.filter({ customer =>
    customer.enabled && customer.contract.enabled
}).foreach({ customer =>
    customer.contacts.foreach(cls)
})

Though not nested in this example the extra spaces and chains of calls with anonymous functions make it harder to immediately recognize that we’re merely applying a foreach to elements matching the filter. Obviously there’s no requirement to do this with anonymous functions – using named functions I think would help shorten up the line and make this a bit easier to instantly grasp. Functional programming however appears to promote and even prefer this style of coding. (Though perhaps that’s just functional programming literary examples?)

You may argue that this really isn’t all that difficult to read – and I’m sure you’re not alone. My complaint however is not about the ability to read, but my ability to read fast. I find when I need to scan multiple lines of code with anonymous functions, I tend to ‘push’ the anonymous functions into my own internal ‘stack’. The longer the chain of calls and the more anonymous functions included, the larger the stack in my head – and the longer it takes to see the intent of the code as I pop each item off. And on this I’m certain that I’m not alone (see Miller’s Law).

Add a handful of these constructs into a single section of code and this can significantly slow down troubleshooting, debugging, and code reviews.

…and please don’t get me started on currying.

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. – Brian Kernighan

What next?

xkcd: Tail recursion is it's own reward. — xkcd.com “functional”

And after all that…I do find functional style compelling and interesting. There is a sort of beauty to well-crafted code in general and functional code in particular. Much of the work I did a few years back in the web space using client-side (and some server-side) JavaScript involved functional coding. And I admit it can be fun and somewhat addictive.

So where to go from here? As I said at open, functional is definitely the new coolness in “big data” and we ignore it at our own peril. Many of the tools we’re using today were developed in functional languages like Scala or Clojure. I’m just not ready to commit wholesale to doing things exclusively (or even mostly) the “functional way”. To this end, my team has been dabbling in Scala, writing unit and functional tests using the excellent ScalaTest suite. The more I work with Scala, the more I like it…and the easier it gets to fully grok the Kafka code I frequently find myself digging into.

With time perhaps many if not all of my concerns will be proven unfounded – or at least mitigated effectively. Internally we’ll continue to promote efforts to absorb functional concepts and incorporate the best bits into our everyday engineering efforts.

Scala has gained significant traction on my team and within the Converesant engineering organization. The plan is to continue to drive this process to discover how far it takes us in becomming more functional.

That's all very well, but…

Patrick Jaromin's blog on Java, Hadoop, Alfresco…and other geeky stuff.

Category Archives: Uncategorized

Using Kafka to Distribute and Dual-load Timeseries Data