Real Time Big Data Analytics with Apache Solr and Spark

Speaker Bio:

Timothy Potter is a senior member of the engineering team at Lucidworks and a ‘committer’ on the Apache Solr project. Tim focuses on scalability and hardening the distributed features in Solr. Previously, Tim was an architect on the Big Data team at Dachis Group, where he worked on large-scale machine learning, text mining, and social network analysis problems using Hadoop, Cassandra, and Storm. Tim is the co-author of Solr In Action, a comprehensive guide to using Solr 4. Mr. Potter holds several US Patents related to J2EE-based enterprise application integration. He lives with his two Shiba Inus in the mountains outside Denver, CO.

Meetup Summary:

Solr has been adopted by all major Hadoop platform vendors as the de facto standard for big data search because of its ability to scale to meet even the most demanding workflows. As more organizations seek to leverage Spark for big data analytics and machine learning, the need for seamless integration between Spark and Solr continues to grow.

At this meetup event hosted by BlackRock, Timothy Potter presented several common use cases for integrating Solr and Spark. Specifically, Mr. Potter discussed how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation across large result sets in Spark. He also discussed topics like exposing Solr query results as SparkSQL DataFrames and interacting with Solr from the spark-shell.

Tim also highlighted the use of MLLib to enrich documents before indexing in Solr, for things like sentiment analysis (logistic regression), language detection, and topic modeling (LDA). All of the concepts presented in this talk are implemented in an open source project donated and supported by Lucidworks. When discussing big data, especially search on big data, it’s important to establish performance metrics. For instance, how many docs per second can be indexed from Spark to Solr using this framework? Or, how many rows can be processed per second when reading data from Solr into Spark? Tim concluded his presentation by showing read/write performance metrics achieved using a 10-node Spark / SolrCloud cluster running on YARN in EC2.
For more photos of the session, please check out this event’s meetup page here!

Submit a comment

Your email address will not be published.