BlackRock’s websites are a window into its products and capabilities. This includes our retail sites, e.g. BlackRock.com, and iShares sites, e.g. iShares.com. On our websites we have a lot of content, products and documents. To serve the BlackRock websites worldwide we have to understand and make all of these different types of information searchable and accessible for our users. Each information type poses its own unique challenges, from indexing the information, to storing it, and finally to searching it. Adding the dimension of different languages given our global presence exacerbates the challenge.
Search has become something we take for granted in our day-to-day internet experience and we expect it to work.
Like with many things, when search works we don’t notice that it’s working well. Our expectations have been set so high we simply demand this greatness in the form of relevant search results. However, as soon as results are not as good as we expect, we immediately notice and frustration quickly follows.
Search is interesting because what we’ve learned is that people want what’s relevant to them at the top of their results instantaneously. We have also learned that the only way this would be possible is by reading minds! Since we are still figuring out a way to capture input from the human mind, this article will look at basic concepts on how Solr gives us the flexibility to improve relevancy of results without mind reading capabilities.
To date, BlackRock relies on a cleverly purchased proprietary solution to power our online search. Unfortunately, it isn’t open source and therefore doesn’t allow us to understand exactly how it works. Not having this transparency limits our ability to innovate on our searching capabilities. Going forward, however, we wanted a transparent open source solution with a flexible framework, affording us the opportunity to improve our on-site search further.
Over the last few months we have explored Solr. We have specifically investigated relevancy with Solr.
Let’s first look at relevancy and how Solr deals with this by default. If you can understand this detail, it gives you a good base for being able to explore more complicated concepts with Solr and on-site search.
The most fundamental thing to know about relevancy before we explore this any further is that it is not an objective measure. It is highly subjective. Let’s use a hypothetical example of “The Awesome Bookstore” to illustrate these concepts.
Assume you have a daughter and you are looking for a book to help you with raising her.
Maybe you only want to see books that are in particular about raising girls. Someone else might be happy to see results about raising children as well. For the sake of this example we assume that books about raising JIRA tickets aren’t relevant to bringing up your child. Unless you and your partner use JIRA for managing tasks related to raising your child. While an unlikely example, stranger things have been managed in JIRA.
When you look at a record in Solr that is searchable it’s called a “document”. So in this case all books in the bookstore are “documents” in Solr. Task one is to bring back all the relevant documents for the search query.
Step two is to score these documents so that they are ordered by relevancy against the search query. This is important because most people don’t go past the first page of search results. Hence you want the most important results to be on the first page.
Step 1: Bring back the relevant documents.
Step 2: Score them appropriately based on relevancy.
To understand how Solr brings back documents for scoring look at the inverted index and the text analysis.
A document consists of several fields. For a book, a field could be the description of the book, its title, or any other fields you believe would be useful for the user to search for.
When Solr indexes a field for a document, it builds an inverted index for the field. Let’s look at the example of the book’s title. Assuming no text analysis (this is covered below) Solr builds an index that looks a bit like the following. In the illustration there are a few items selected to show in the inverted index example.
Solr has taken the text and tokenised it. Each word is now a term and is the key in the inverted index. The term now points to the document. It also hold the information of the position of the term in the field. This means Solr can, for example, do phrase searches quickly.
So now we understand how the inverted index work we realise that if we stored every “a”, “the” and “but” in the inverted index it will grow quickly. If we also stored “download”, “downloaded” and “downloading” we would not get back documents containing downloaded when we search for download. We can help these types of issues with Text Analysis in Solr.
When we define a field we define which type of tokenizer and what filters we would like to pass our terms through before it gets to the inverted index.
Let’s look at a comment someone might have left on a book.
There are different tokenizers you can use. Here are two examples of the Standard tokenizer and the WhiteSpace tokenizer. You can see that the Standard tokenizer also removes word delimiters like full stops and parentheses. In this case we decided to use the WhiteSpace tokenizer to preserve the hash tag. However, as you can see in the picture we’re also left with tokens that include dashes and exclamation points.
We then add several filters, you can read more about the different filters here: (https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions).
We add a filter to remove the word Delimiters from the tokens. See the green words highlighted where change has happened. We also lose the smiley. We have promoted the # to an ALPHA in our configuration file which means it is preserved.
Adding the LowerCase filter, this avoids creating different terms in the index based on the case.
The Stop filter removes all the words that you could consider unnecessary in the language to index. In this example we’ve picked words such as “the” and “is” to be removed. This is fully configurable.
After the stop words have been removed, we’ve added the filters ASCIIFolding and KStem. ASCIIFolding would, for example, remove any apostrophes or umlauts. The KStem stems words. While not perfect, it’s still helpful in promoting relevance. You can see in the highlighted tokens that “unicorns” has become “unicorn” for example.
There are many different stemmers out there so play around with to find one that suits your needs. The order of the filters is also important. You want to make sure you make everything lower case before you remove stop words. It would be rather difficult to maintain a list of all possible cases for the stop words.
Additionally, it’s important to note that when you define the field you have to make sure you apply the filters in the same order to the query; otherwise, you won’t find anything. However, at the time of query, you might want to add some additional filters. One example would be the synonym filter. As you can see in the review of the book it says UK. But if someone searches for United Kingdom we would like to bring the review back as a result. We can then use a synonym filter to add UK to the query, so that we are able to bring back the desired results.
Once we have the documents that the text matches the query, we need to score these documents fairly. This topic is explored at great length on the internet, so follow this link (http://ipl.cs.aueb.gr/stougiannis/default.html) to read about the default scoring algorithm in Solr/Lucene. Just to spark your curiosity, here’s what the formula looks like.
If you took the time to click through to the link, you should now be familiar with the concept of field and document boosting. This is a simple concept that you can easily control. In many cases, the title of a book is more important than the description. Hence if there is a match in the title field, we would like to boost that higher. You can also boost an entire document – this might be useful for products that are promoted during a particular season (ie, a retailer promoting shovels in the winter vs. shorts in the summer) or that are your core products that you want to promote to customers. We can therefore boost the entire document. Another option for boosting is to boost at query time, perhaps the most important term.
Solr is flexible, Solr gives us transparency.
For this instance, Solr will not only be able to provide the transparency, but also the flexibility to build a solid search platform for websites. It is customizable and can promote desired relevancy. For these use cases, Solr is easy to get started with and provides a basis to start making search available.