Today, I want to talk a bit about Solr’s JSON faceting system, typical facet uses, and then get into more advanced uses.
Most Solr users already have a pretty good idea of what facets are used for. A facet result, at its most basic level, tells you how many documents in your result set have a given value in a given field. Originally, facets results were requested using request parameters, and were somewhat limited.
As of Solr 5, JSON facets are really the way to go. Yonik Seeley and others already give a pretty good description of how JSON facets can be used to replace the antiquated general facet request. The JSON facet API is vastly more powerful than the old parameter-based facet request because it allows nesting, statistics, querying and more. Let’s get into some examples
Basic JSON Facet Usages
Let’s say your index is made up of documents that contain a field color . A facet on the color field might yield red:6, blue:2, white:13. What does this tell you? It tells you that in your current result set, six documents have red in that field, two have blue and thirteen have white. Pretty simple.
Is that the only thing it tells you? No. It also tells you that the domain (possible values) for the current result set is [red,blue,white]. At the moment, this seems obvious, but it’s important to think about it from that perspective. Why? If the documents in your index represent products, and they have a location as well as a color, then the question you may be trying to answer is: “what colors are available at the Boston warehouse?”. This question is easily answered by issuing a query location:Boston with a facet request on the color field (with mincount=1). So, faceting becomes useful from a reporting standpoint because questions like this are often asked by our clients
More Advanced Usages
The JSON facet API has a lot more power built in. First, the facet requests can be nested. Using our example from above, this means that instead of filtering to location:Boston, we can answer the question: “what colors are available at all locations”. You can do this by nesting the color facet inside of a location facet.
location{ buckets[{ val:"Boston", count:123, color{ buckets[{ val:"red", count:50}, { val:"blue", count:40}, { val:"white", count:33} } },{ val:"New York", count:102, color{ buckets[{ val:"red", count:35}, { val:"blue", count:34}, { val:"white", count:33} } } }
In the above example, there are 123 products in Boston (50 red, 40 blue, 33 white) and 102 in NY (35 red, 34 blue, 33 white)
Now you can begin to see how one might build an analytics and reporting system based on facets. Most of the things I’ve talked about so far have been written about before. There’s a good amount of information on Yonik’s blog. What I wanted to add is an example of how we’ve used JSON facets with one client to build a comprehensive report, and what we had to do to get there.
Building a Comprehensive Report with Advanced Querying
One of the really nice newer features of the JSON facet API is the ability to “change domains” in a facet request.
Domain change allows you to exclude parts of your base query and/or add a filter in order to change the result set that the facet is working against. Using our example above, if the base query includes a filter location:Boston, you may still want to be able to answer the question about all locations. Without a domain change, the facet result will only contain Boston because that’s what your result set is filtered to. A domain change allows your facet request to ignore that part of the filter/query. This is a powerful thing, because in a single query request, you can obtain your normal query results as well as any other (perhaps totally unrelated) data, in as many facet requests as is necessary. This is essentially like doing multiple searches with a single request.
Our client needed a tool to allow their users to obtain extremely detailed information and calculations on their data. The search entailed several steps. First, it required a basic search to obtain the correct set of documents. Then we had to join (graph really) in other documents that contained useful information. Finally, many calculations and aggregations were run on the data using facets. The JSON facet API allowed us to do this nicely with a single query (with one exception, which I’ll get into later). We used the existing search and facet API to accomplish the task. However, at some point, it became useful for our client to have a full report of the findings rather than a search interface. Essentially, what they needed was the resulting outputs from all possible queries of which there were thousands. Issuing thousands of queries to Solr and then manually aggregating the results was not feasible. The astute among you may see where I’m going with this. Since our result data was already facet based calculations and aggregations, obtaining all possible search results is just a parent facet to the existing query. Here’s a simplified example: Revisiting our product example, let’s say that the search we built allows users to search for a product and get a breakdown of average sale price per location. The facet request might look something like this:
&q="product_name:\"dell xps 15\"" &json.facet={ avg_by_location_and_color:{ type:"terms", field:"location", mincount:1, facet:{ color:{ type:"terms", field:"color", facet:{ avg_price:"avg(price)" } } } } }
In this example, you’ll see the average price of the “dell xps 15” broken down by location and then color. Now, it might be a little easier to see that if we want to get those numbers for every product, we can omit the original query, and wrap the entire facet in a parent facet that looks at the product_name field (assuming product_name is not tokenized). This is similar to what we were doing for our client to produce a very large report for them. The one hangup we had was that their original “product” query wasn’t this simple. In fact, it was a graph query whose job it was to join in other documents containing data we needed for calculations. At the time, the domain functionality of the JSON facet API supported join, but not graph. As a result, we ended up adding the functionality to Solr in version 7.4. This allowed us to recreate our base query as a facet request and obtain all possible values at once. This use case is probably not what you normally want to do, because it’s effectively thousands of queries (in our case) at once and consequently very slow. But this is what our client needed for this specialized case, and it illustrates the power of the JSON facet API nicely.
Extending the JSON Facet API
Earlier, I alluded to the fact that we could basically get away with doing everything in a single Solr query. This is really nice for a number of reasons. For one, having all of your data consolidated and returned in a single response eliminates some timing housekeeping you’d otherwise have to do in your UI. We almost get away with doing everything as a single query, but not quite. And this gives me a bit of an opportunity to talk about how I’d like to extend the facet API next. There is one piece of functionality that isn’t present yet in the API.
During our overly complicated faceting hierarchy, at some point we’re obtaining data from one of the facet results and we’d then like to use that data in some further sub-faceting. As a quick and simple example, imagine you want to use our above facet request to get the average price, but then you want to sub-facet to show how many products are below average and how many are above. It would be really nice if you could use your calculated values in descendant facet requests.
&q="product_name:\"dell xps 15\"" &json.facet={ avg_by_location_and_color avg_by_location_and_color{ type:"terms", field:"location", mincount:1, facet:{ color:{ type:"terms", field:"color", facet{ avg_price:"avg(price)", facet:{ below_avg{ type:"query", q:"price:[* TO ${avg_price}]" }, }, above_avg:{ type:"query", q:"price:{${avg_price} TO *]" } } } } } } }
Pseudo code showing how we'd like to be able to use calculations in sub-facets
Unfortunately, this currently isn’t possible. For this reason, we have to issue a first request to obtain the values and then a second request with those values encoded in. If this is something you’d like to see implemented at your organization, please reach out!
[…] 12, 2021 The Cross Collection Join Query By Dan Fox July 15, 2020 Solr JSON Facets for Reporting and Data Aggregation By Dan Meehl April 1, 2020 Previous PostIngesting Solr Logs […]