BlackRock consumes a huge amount of security and market data. The Market Data team at BlackRock is responsible for retrieving and loading market data including but not limited to terms and conditions, factors and coupons, credit ratings etc. into Aladdin. In addition to maintaining overall data integrity, the team is also responsible for access and control of this sensitive data. The team is essentially in charge of retrieving market data, aggregating it, and transforming it to something manageable that our downstream processes and tools can understand. Currently, most of this work is done by a highly interconnected network of servers and processes.
This market loading platform was initially built around the idea of batch processing, i.e. we would break up the data into different chunks, distribute the work across processes and then collect the results before persisting it in a data store. This paradigm initially met our requirements as the market data we were processing was not real-time or stream oriented in nature. However, there has been a shift in the way in which market data is being both generated by vendors and consumed by clients. Market data delivery mechanisms are moving towards stream-based systems and there’s been an exponential increase in the velocity, volume and variety of streaming market data available. Traders, portfolio managers and analysts increasingly need to analyze this data in real-time to know the market better and find their liquidity options when executing orders etc. We wanted to build a platform that could handle both loading of this streaming market data as well as analyze it in real-time. After researching different open-source platform options, we decided to build our market data loading platform on top of Apache Storm.
There has been steady growth in the availability of security and market data across vendors and our business has continually required us to add new data feeds such as IDC Prices, Composite Dealer Prices, Equity Prices, and Dealer Runs etc. which are all streaming or almost real-time in nature. With these considerations, we began to assess the requirements for our next generation market data platform.
The need was to be able to handle high velocity streaming data from vendors such as ‘prices’ for example which has very low latency requirements. Another example is that we’re expecting about 4 million records of equity pricing tick data that needs to be streamed into our servers and dashboards in near real-time. Besides this, for certain other kinds of data like mortgage or credit ratings, we also want to be able to support ‘at least once’ semantics and ‘at most once’ semantics in certain cases.
Since most of our downstream processes also use the market data we load in different capacities we want to be able to support the following features:
- Ability to do streaming analysis or continuous computation of the market data via our own API’s and tools
- Ability to easily and proactively monitor and manage third party data loaders
- Ability to compute metrics on the data feeds that we receive
- Ability to report errors and exceptions in real-time which can be actioned upon
As the volume of the data we load via the platform increases, we need to adhere to the following:
- Build a horizontally scalable architecture
- Support for automated fault tolerance
- Replay mechanisms at different levels of granularity
- Ability to load balance between different streams of data
There were other factors that we had to take into consideration while designing our next generation market data platform based on the different types of data we got, such as:
- An easy way to define different pipelines across different feeds of data
- Ability to re-scale and reuse existing components
- Easily configurable system for on-boarding new data feeds
Building a system from ground up and implementing all of the above features and requirements is a complex and incredibly time intensive task. Having researched and surveyed several projects and functionality, we realized that Apache Storm was a perfect fit as it abstracts away many of the difficulties of building a real time system and provides us with a lower level API which enables us to build an elegant system.
Apache Storm in Production – Broker-Dealer Runs Loader
For our first prototype we decided to construct a topology to aggregate and load private equity data we got from a vendor. Although, the data that is being generated isn’t exactly “streaming” or “real-time” in nature, we decided to build this topology out as an experiment and deploy it into a development cluster. The idea was to learn as much as we possibly could about this relatively new framework and see if it managed to fit most of our requirements.
The components of the topology itself are pretty basic. There is a file spout component which listens to incoming files from the vendor (which is considered an event in this case), and this is connected to a transformation bolt which transforms the raw data into the Aladdin model based on a set of rules written by data analysts. The transformation bolt is finally connected to a persistor bolt which persists the data into a data store. While building out the topology we dug deep into concepts such as topologies, streams, the storm data model etc. More information on this can be found here.
Our first real topology in the production environment was Broker-Dealer Runs. Some context on “broker-dealer runs” – on a daily basis broker-dealers advertise the securities and products they wish to trade and are making a market on the Buy Side (for example – asset managers). Typically, “Dealer Runs” are used to advertise liquidity and pricing in less transparent and more illiquid markets such as those found with corporate bonds, mortgage backed securities etc.
This dealer runs data comes to us essentially as emails from broker dealer entities. We designed a topology to capture these emails and process them. Shown below is the final version of the topology that we intend to deploy into production. Currently, we have a simpler version of the below topology.
We have come a long way from analyzing and building our first prototype, while convincing our various stake-holders to actually get a cluster out in production. Currently we are working on other projects to on board more data into our storm platform like equity premium prices, derivatives prices, composite prices and trace data which potentially fall into the category of streaming data.
The data provided is for informational purposes only. The information and opinions contained on this website are derived from proprietary and non-proprietary sources deemed by BlackRock to be reliable, are not necessarily all inclusive and are not guaranteed as to accuracy. BlackRock shall not have any liability for the accuracy of the information contained herein, for delays or omissions therein, or for any results based on the use of such information.
©2015 BlackRock, Inc. All rights reserved. BLACKROCK and ALADDIN registered and unregistered trademarks of BlackRock, Inc., or its subsidiaries in the United States and elsewhere. All other marks are the property of their respective owners.