Abigail Ye joined BlackRock in 2013 and is now leading the orchestration system in BlackRock’s Aladdin Product Group. Ms. Ye has more than 6 years’ experience working on distributed systems, with a passion for improving scalability, reliability and flexibility. At present, she is working on re-architecting Aladdin for a transition to a cloud centric architecture and is focused on re-architecting the orchestration system.
Ms. Ye has delivered many successful presentations in BlackRock based on her professional knowledge of computer systems and was part of the team that won BlackRock’s 2015 global Hackathon. She has a strong educational background in computer science, with a Master’s Degree in Computer Science and a Bachelor Degree in Applied Mathematics. In addition, she is an honor member of the national Alpha Phi Omega organization.
This work outlines points to consider when transitioning to a cloud network architecture based on BlackRock’s experience. It analyzes why many enterprises choose to virtualize their infrastructure and how BlackRock has solved many of the same issues as virtualization with its standardized Core Platform. It then explains the reasons and approach for BlackRock’s transition to a cloud centric architecture.
Many large enterprises have a diverse set of applications, where each app is often built with its own technology stack and unique sets of dependencies. This leads to a proliferation of many different technologies that need to be supported. Each application needs a separate host which is built with the specific technology stack that is required.
The hosts are typically deployed using virtual machines (VMs). VMs emulate a host and are an efficient way to deploy many operating systems and technology stacks to one physical machine. Diagram (a) demonstrates how traditional virtualization works.
While virtual machines provide many benefits in terms of efficiency, there are also drawbacks. There is a memory tax for each VM, they can be slow to startup, and low utilization can result in wasted resources.
ALADDIN CORE PLATFORM
For BlackRock’s Aladdin platform, we have taken a different approach to how infrastructure is built. The Aladdin platform is an operating system for investment managers that seeks to connect the information, people and technology needed to manage money in real time. It consists of thousands of discrete applications running simultaneously.
The Aladdin Core Platform takes the approach of providing a single unified technology stack for all applications. We call this the “Aladdin Operation System”. Since each application shares the same set of dependencies we can run many applications on one physical host without the need for virtualization. You can see this in diagram (b).
The Core Platform is identical across machines, so applications and business logic can be easily moved from one machine to another. This helps us to improve scalability, resiliency, and failover. The Aladdin Core Platform has successfully solved many of the same problems as virtualization.
- We can run hundreds of apps on a single box
- We are only limited by the resources of the machine and don’t have the overhead of virtualization
- We can easily move apps between machines for redundancy and resiliency purposes
However, even with the advantages of the Core Platform, there are still benefits we can gain by transitioning to a cloud architecture.
Apart from the first party applications that make up Aladdin, we are beginning to deploy more third party tools to do big data analysis, such as Hadoop and Spark. We want the ability to run these tools on the same hosts we use to run our first party apps for the sake of efficiency.
We are also beginning to deploy applications that have more bursty request profiles. Historically we could manage with a fixed capacity to handle requests but we are now looking for ways to more easily scale our applications by running more instances.
Additionally, we are looking for the ability to easily deploy disposable test environments, so different development teams can easily test in isolated environments.
We are transitioning to a cloud architecture to help solve these challenges. Cloud is a very popular term and has several meanings. For Aladdin, it means that we want to have a generic pool of resources that can be used to run any workload, whether it is a first party application or third party analytics tool. We also want the ability to run processes on any host so we can easily scale applications horizontally.
Diagram (c) shows the cloud architecture. Any application or business logic can be run on any host in the cloud environment. The additional benefits we can get from here are:
- Efficiently support big data analysis by sharing resources with existing workloads
- More scalability to support bursty requests
- Easily deploy disposable test environments
Our approach is to use a container based system to deploy applications. Containers are portable application units that are able to run on any machine. There are several things to consider when making the transition to a container based cloud architecture.
Within a cloud architecture, all machines should be built identically. We are going to package applications into container images to ensure they run reliably when moved from one machine to another.
Containers bundle an entire runtime environment for the application: application binaries, all its dependencies, libraries and configuration files into a package. Besides packaging dependencies, containers also provide the additional benefits of first class versioning and isolating applications at runtime. Unlike VMs, containers still share the operation system of the host and have much faster startup time and better resource usage efficiency.
With the dependencies of the application built into the image, we faced a challenge with patching shared libraries when there is a security vulnerability (like Heartbleed). It is quite difficult to rebuild each application image with the patch when you have hundreds or thousands of applications. The approach we took was to inherit shared libraries from the host rather than pack them into the image. We are able to do this because the Core Platform ensures that all applications share the same set of libraries.
It is important for BlackRock that we integrate the container lifecycle with the existing processes that we already have in place. We spent time integrating the containers with the existing ways that we build, deploy, and run applications. We chose Docker as our container engine, but spent time building the appropriate abstractions so it is easy to work with.
Containers help to build a portable application that can easily run anywhere. The next question becomes how should we control these applications in containers and how can they share resources with other workloads in the cloud?
An orchestration tool is the answer. It takes over work from the developer by automating installation, scaling and maintenance of applications. This allows developers to focus on building the application logic while the orchestration tool handles the deploying and execution of applications.
The traditional way apps are deployed doesn’t work when we move to a cloud architecture. There are a couple areas to consider:
- Static host assignments to dynamic scheduling
In the VM model, operations teams build specific hosts for apps to run on that contain the required technology stack as well as a sufficient amount of resources. This process is very manual and the apps are statically assigned to run just on these hosts. This makes it a challenge to vertically scale applications according to request volume, because new hosts need to be built. It also provides limited resource sharing between applications and other workloads, resulting in inefficient resource usage.
Orchestration tools solve this by automatically scheduling applications to run wherever there are available resources on a generic pool of machines. In this environment, “scheduling” refers to the ability to find the hosts to run the application, running the application on the host, and easily scaling the application by running more instances.
- Service discovery
Applications that communicate with each other in a distributed system need to know the address of the services they depend on. This is relatively easy in the traditional deployment model where applications are statically assigned to run on specific hosts. Addresses are typically stored in a configuration file that changes infrequently. However, with the dynamic assignment of application onto any host in the cloud, the network locations are dynamically changed. The service discovery system makes it easy for applications to find the address of the service they depend on, no matter where it is running.
One final point to consider is isolation. Isolating is done by the VM in the virtualization model. In the cloud model applications run on the same physical host and it is important that we isolate them to provide security and resource controls.
In the VM model, it could be said that a host has a specific ‘personality’ based on the packages installed, file systems that are exposed, and the network rules applied by the firewall. In the cloud architecture model, the host becomes generic and the personality is applied to the application, by the orchestration system. This means that a set of file systems and network rules are configured at the application level rather than the host level. The required packages are installed in the container image, as we previously discussed.
It is crucial that you catalog the file system and network dependencies for each application so they can be correctly applied by the orchestration system when running the app.
In conclusion, many companies use virtualization to be able to run many different technology stacks on the same physical machine. There are costs in terms of memory usage and startup time. BlackRock’s Aladdin takes a different approach, running all applications on a single technology stack with the standardized Core Platform. However, there are additional benefits Aladdin can leverage by transitioning to a cloud architecture, such as resource sharing with disparate workloads and improved scalability. There are several consideration points on the transition: packaging, orchestration and isolation.