Distributed, configuration based ETL in Apache STORM

The Security and Market data team at BlackRock is charged with onboarding all of the indicative and market data required to fuel the Aladdin platform. Having recently transitioned from Perl based ETL to a new platform built using Apache STORM, we were faced with a familiar problem:

How to effectively describe, deploy and execute complex data transformations?

Naively I figured that jumping 10 years forward from a proprietary Perl platform into a world rich with open source big data tools, we would be spoilt for choice. I was surprised and a little disappointed with what I found. To help outline the problem, this table broadly summarizes the typical data feeds we onboard through STORM:

DataFeedCharacteristics

I categorize any feed not containing nested collections or complex structures as simple. These feeds are typically high velocity like prices or market rates and require only simple transformations. In these cases, performance is key and rules based transform is not needed.

For unstructured feeds such as email or prospectuses, we are using a number of ML approaches which fit well into the STORM platform.

The functional gap is for the highly complex, relatively low velocity structured data feeds. Our legacy Perl platform though aging, had a very capable configuration based rules engine which has been the basis for strong collaboration between developers, testers and our end users. Based on our experiences with that platform we outlined our needs for a new translation engine:

BlackRock Rules Based Translations

It is also helpful to understand common challenges we face with this data. In most cases, this is security indicative data, basically the term sheets. Typically delivered as XML or in-house RDBMS, complex instruments contain 10 or more levels of nesting with relationships between sub-collections.

The contenders:
When looking at both open source and commercial options we saw three types of solution emerge:

1: Data Wranglers: ML based platforms which allow end users to train the system to make translations. Examples are Tamr, Trifacta, Paxata or Data Wrangler

2: UI Based Transform Editors: Traditional ETL tools with modern connectors offering palette based drag and drop building blocks to define transformations. Examples are Talend, Kettle, Informatica etc.

3: DSLs: There are many examples of Domain Specific Languages well suited to describing translations. Examples are Pig, Hive, Drools, Morphlines …

There is of course a fourth option which is to build something ourselves from the ground up.

It quickly became clear that the data wrangling tools while exceptionally promising could not handle these complex transformations. Though they were able to solve the transform challenges we also ruled out the traditional UI builder approach. We focused on Morphlines and Drools and in truth I think either would work but we selected Morphlines from the Cloudera Kite SDK based on its simplicity and flexibility.

Morphline … what’s a Morphline?
Excited about the tool and tired of answering the question “Morphline … what’s a Morphline?” we put together a short video describing Morphlines and its strengths:

Also take a look at the Morphlines Introduction and Reference Guide

http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

http://kitesdk.org/docs/1.1.0/morphlines/morphlines-reference-guide.html

Integrating with Storm
The biggest advantage is the ease with which Morphlines can be integrated with STORM. This wasn’t an original idea, we first saw the pairing in this blog entry

https://www.javacodegeeks.com/2016/01/configurable-etl-processing-using-apache-storm-kite-sdk-morphlines.html

We have integrated Morphlines with our own STORM libraries and the results have been very compelling. Transformation chains are proving easy for our business experts to understand and performance has been very strong.

The diagram below shows one STORM topology we use for onboarding complex structured data:

ApacheStorm

Walking through the flow:
• The File Spout monitors for newly received files
• The File Reader bolt splits large files into individual records
• Raw data is written to a Cassandra store for further analysis or to support replay
• The Enrichment bolt works on a batch of records to source additional external data
• The Transform bolt executes the Morphline commands
• Final transformed output can be written to target data stores using multiple persistence bolts

The topology view from Storm UI:

BlackRock

Some of the benefits we’ve seen so far
• Vastly improved performance over our legacy platform (1+ hours for 7Gb XML down to ~90s)
• All the benefits of STORM including scale, monitoring and reuse
• Rapid time to market as Morphlines has a very shallow learning curve and most transformation patterns are already supported.
• Clear, readable transformations

Some thoughts on the future:
While legacy, turnkey ETL tools may not have a role in the big-data landscape, the need to bring data into a common domain, preserve lineage and provide adequate monitoring is more critical than ever. In many ways the preference for de-normalized data and polyglot persistence make data enrichment more necessary and complex than in a small data relational world.

I don’t doubt that ML based tools like Trifacta will in the coming years remove the need for configuration based ETL even in complex scenarios. In the meantime Morphlines has become another piece of our growing toolset which allows us to quickly and efficiently get data into the hands of our users.

TECH-0058