The Financial Modeling Group – Advanced Data Analytics (FMGADA) team is the large scale data science arm within BlackRock Solutions. We utilize “big data”, data sets that start with many points and rapidly grow, to solve investment management problems such as building behavioral models for mortgages etc. Apache Spark is our primary tool for managing the computational power and storage capacity necessary to handle these data sets. Read on to learn about the challenges with, and a sample solution for making big data tools accessible to users across the technical proficiency spectrum.
Big Data’s Accessibility Problem
Typical big data tools trade scalability for accessibility. They can handle data sets that are too large and computations that are too complex for a single computer by distributing them across many machines. However, their rudimentary frontends obscure this power. Typical interfaces, like Spark’s spark-shell and spark-submit, are more complicated and have fewer features than those of standard business software, such as Excel. Excel enables users to load data into a table, process it, and visualize it using a simple graphical user interface. Users of spark-shell need to study the Scala language and the Spark API before loading a data set. Worst of all, visualizations are not possible within the tool. This rudimentary interface is initially complicated for some business users that do not typically use tools like these on a frequent basis.
The Spark-shell interface
The dearth of intuitive user interfaces limits the wider adoption of big data. Business users avoid larger data sets despite their advantages because the necessary analysis tools are too cumbersome to integrate into standard business processes. More accessible interfaces will drive adoption by providing users with an easy way to experience the benefits of big data.
Making Big Data User-Friendly
I present a version of the Scatterplot Matrix (SPLOM) visualization as a proof-of-concept for how to build user-friendly interfaces to big data systems. Unlike terminals, interactive visualization tools enable users to intuitively solve business problems with big data. Our SPLOM is an interactive combination of scatterplots that describes the distribution of a data set in each dimension and the relationships between dimensions. The SPLOM’s colors show how a k-means clustering analysis breaks down a data set into subsets of related points. Points share a color if the analysis considers them similar. The buttons shown in the screenshot provide the functionality necessary to alter the visualization to suit the user’s needs, such as adding or removing categories from the matrix.
Andrew Ng’s lecture notes at http://cs229.stanford.edu/notes/cs229-notes7a.pdf are an excellent resource for learning more about k-means clustering.
The resample button demonstrates how the right interface can enable users to utilize distributed computing systems without extensive technical knowledge. The data in the SPLOM is a random sample of a larger data set. The resample button resamples the underlying data set and draws a new set of points in the SPLOM. It does this by starting a job on a cluster of computers to select a new random sample. When the job finishes, the webpage pulls the data from the cluster and automatically draws the new sample on the screen. The user has used a big data tool without ever knowing it.
Comparing multiple, random samples with the push of a button enables users to intuitively ignore visual artifacts resulting from skewed sampling. Users just need to resample multiple times. The more common, uniformly distributed ones will not show the artifacts and will focus only on the actual trends. This non-technical analysis demonstrates how anyone can effectively utilize distributed computations to understand a large data set without understanding any programming languages or APIs.
The Tech Behind The User-Friendly Interface
The resample button relies on a stack of technologies to enable an intuitive, visual distributed computing experience. When pressing the button, the webpage sends a post request to a server program known as the Spark Job Server. The Job Server processes the request and runs the job on a Spark cluster. After the job finishes, the Job Server gets the result of the job and places it in a location that is accessible by a REST call. The website long polls the Job Server to get this result and renders it.
Follow the steps on the EC2 section of Spark Job Server’s Github repository to spin up your own Spark cluster and interactive SPLOM visualization on EC2.
To learn more about the Spark Job Server, please visit the Github repository. Go to my pull request to see my contributions to Spark Job Server: scripts for automating the Spark Job Server life cycle on EC2, the addition of CORS headers to enable cross domain communication with the Job Server, and an implementation of the SPLOM example shown in this blog post.
I Big Data, You Big Data, We All Big Data!
Big data tools need more intuitive interfaces before they reach mass adoption. Interactive visualizations like the SPLOM is only one such example of an interface. My coworkers and I will explore many other approaches as we build practical big data tools for business users. In the end, we will have a system for anyone that needs to utilize massive data sets to solve investment management problems.