Search Engineer at KMW Technology
What We’ve Accomplished
In Lucene-based search engines like OpenSearch and Solr, keyword aggregations ignore duplicate values that occur within a multi-valued field. If a document has a multi-valued field containing the values [“foo”, “foo”, “bar”] then an aggregation would increment the count for the “foo” and “bar” bucket once, even though “foo” occurs twice. We built an OpenSearch plugin to overcome this limitation.
In this blog post, we detail the investigation process into duplicate terms aggregation solutions and the subsequent development process by which a plug-in was built for OpenSearch. Previously, only the nested field type and scripted aggregation methods seemed to be viable solutions. Our plug-in aims to be a third option to solve this problem. This plug-in serves to be the fastest solution for a duplicate terms aggregation on a multivalued keyword field in OpenSearch without compromising index size.
The link to the plug-in repository can be found here. Please follow the steps in the README.md to install this plug-in into an OpenSearch distribution and interact with the custom aggregation.
The Problem
Imagine you’re rummaging through your attic at your family home and you’ve found a large recipe book. It seems to be something your great-grandmother cherished. As both a devoted great-grandchild and a fan of search solutions, you’d like to catalogue the data found in this recipe book in a search engine, specifically OpenSearch.
You wish to represent each recipe as a document and each document would store fields for different aspects of the recipe: cook_time, steps, technique, ingredients, etc. Ingredients could have duplicate values, designating how many of them to include i.e. [carrot, carrot, apple, cucumber] would represent 2 carrots, 1 apple, and 1 cucumber for our recipe.
Upon completion of your index, you wish to aggregate upon all the ingredients and find out how many of each ingredient to purchase in order to make all of your great grandmother’s recipes. You notice a problem quite quickly: OpenSearch aggregations do not take into account duplicate values on multi-valued fields. While the recipes may have called for a total of three carrots (two for one recipe and one for a different recipe), the aggregation result only tells you to buy two (one for each recipe).
Documents & Aggregation:
curl -XPUT "http://localhost:9200/recipe-book/_doc/1" -H 'Content-Type: application/json' -d'
{
"favorite_foods": ["carrot", "carrot", "apple", "cucumber"]
}'
curl -XPUT "http://localhost:9200/recipe-book/_doc/2" -H 'Content-Type: application/json' -d'
{
"favorite_foods": ["carrot", "apple", "cucumber"]
}'
curl -XPUT "http://localhost:9200/recipe-book/_doc/3" -H 'Content-Type: application/json' -d'
{
"favorite_foods": ["cucumber", "cucumber", "cucumber"]
}'
curl -XGET "http://localhost:9200/recipe-book/_search" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggregations": {
"ingredient_frequency": {
"terms": {
"field": "favorite_foods"
}
}
}
}'
Result:
"aggregations":{
"ingredient_frequency":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
{
"key":"cucumber",
"doc_count":3
},
{
"key":"apple",
"doc_count":2
},
{
"key":"carrot",
"doc_count":2
}
]
}
}
Delving Deeper Into the Problem
Why is it that OpenSearch is unable to perform aggregations on duplicate values in keyword fields? Behind the scenes, OpenSearch is using a doc_values data structure for operations like aggregations, as they perform far better than the traditional inverted-index data structure. The specific type of doc_values supported by Lucene for keyword fields do not store duplicate values. Conversely, a match_all query would return us duplicate values as it would be using a segment lookup unlike aggregations which would be using the aforementioned doc_values data structure.
The following link provides more information about specific field types and whether they support duplicates or not. Note that while this links to Solr documentation and presents Solr specific information, the section on docValues types is a Lucene specific implementation detail that applies to OpenSearch as well.
Scripting
OpenSearch offers users the ability to include scripts with aggregations via the “scripted_metric” aggregation type. This leverages the painless scripting language, built specifically for Elasticsearch (and subsequently OpenSearch). Painless has Java-like syntax and compiles directly into JVM bytecode, leveraging any optimizations that the JVM has.
With painless, we have access to the “_source” variable that offers us the ability to view the segment data on specific fields. With this approach, we should be able to access duplicates on our keyword field type and aggregate them together via a hashmap defined by our script.
Here’s what the script looked like inside of an aggregation query:
{
"size": 0,
"aggs": {
"ingredient_frequency": {
"scripted_metric": {
"init_script": "state.foodFreq = new HashMap();",
"map_script": "for (f in params._source.favorite_foods) { if (state.foodFreq.get(f) == null) { state.foodFreq.put(f, 1); } else { state.foodFreq.put(f, state.foodFreq.get(f) + 1); }}",
"combine_script": "return state.foodFreq;",
"reduce_script": """
Map finalMap = new HashMap();
for (map in states) {
for (key in map.keySet()) {
if (finalMap.get(key) == null) {
def val = map[key];
finalMap.put(key, map[key]);
}
else {
def prevVal = finalMap[key];
finalMap.put(key, prevVal + map[key]);
}
}
}
return finalMap;
"""
}
}
}
}
Notice that this is quite brute-force. We have to account for each and every value, place them in a hashmap, and then combine these hashmaps across shards via the “reduce_script”. Also, we don’t leverage the highly performant doc_values data structure, built for aggregations. We know this, as we are accessing “_source” which takes information directly from Lucene segments, rather than from doc_values. In the benchmarking section, these observations will be substantiated by our gathered performance statistics.
For more information on the scripted_metric aggregation, here’s a link to official documentation.
Nested Fields
Scripting worked within the confines of our original organisation of the recipe book. We wanted our ingredients to be represented in a multi-valued field. However, if we can’t see duplicate values across multi-valued fields due to the doc_value limitation, why not try to change how the documents themselves are indexed?
OpenSearch offers users the ability to create inner or “nested” documents within other documents. Each nested document is underlyingly treated as a “hidden” Lucene document, allowing for a nested aggregation that would guarantee that we would be able to see duplicate documents.
We can implement this, simply, with the following:
PUT /recipe-book
{
"mappings": {
"properties": {
"favorite_foods": {
"type": "nested",
"properties": {
"food": {
"type": "keyword"
}
}
}
}
}
}
PUT /recipe-book/_doc/1
{
"favorite_foods": [
{
"food": "carrot"
},
{
"food": "carrot"
},
{
"food": "apple"
},
{
"food": "cucumber"
}
]
}
PUT /recipe-book/_doc/2
{
"favorite_foods": [
{
"food": "carrot"
},
{
"food": "apple"
},
{
"food": "cucumber"
}
]
}
PUT /recipe-book/_doc/3
{
"favorite_foods": [
{
"food": "cucumber"
},
{
"food": "cucumber"
},
{
"food": "cucumber"
}
]
}
{
"size": 0,
"aggs": {
"favorite_foods": {
"nested": {
"path": "favorite_foods"
},
"aggs": {
"food_freq": {
"terms": {
"size": 10000,
"field": "favorite_foods.data"
}
}
}
}
}
}
"aggregations": {
"favorite_foods": {
"doc_count": 10,
"food_freq": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cucumber",
"doc_count": 5
},
{
"key": "carrot",
"doc_count": 3
},
{
"key": "apple",
"doc_count": 2
}
]
}
}
}
This implementation is simple at first glance, but this approach can quickly cause problems at scale. Nested documents are inherently documents, therefore they take up their own space in the index. 200 recipes with 20 average ingredients in each would lead to over 4000 documents. Again, we’ll see how this implementation fares in our benchmarks.
For more information on the nested field aggregations, here’s a link to official documentation.
Underscore Representation
Let’s try something different. Let’s go back to using multi-valued fields, but this time, let’s look at how the ingredients themselves are represented. What if we changed the representation from [carrots, carrots, apple, cucumber] to [carrots_2, apple_1, cucumber_1]? By appending our counts, we retain information about duplicates, without exposing duplicates to the underlying doc_values data structure. We can perform this conversion at index time well in advance. The question now remains as to how we can aggregate upon those “suffixes” or numbers after our underscore delimiter.
We can try to write code directly within a cloned repository of the OpenSearch codebase and test to see if we can create a performant solution with regards to this representation of our ingredients. If this solution proves to be a winner in our benchmarks, we can move this code over to a plug-in architecture that’s easily packageable and can be installed via a single command by other users.
Here’s how the example from before might look like with our new representation:
curl -XPUT "http://localhost:9200/recipe-book" -H 'Content-Type: application/json' -d'
{
"mappings": {
"properties": {
"favorite_foods": {
"type": "keyword"
}
}
}
}'
curl -XPUT "http://localhost:9200/recipe-book/_doc/1" -H 'Content-Type: application/json' -d'
{
"favorite_foods": ["apple_1", "carrot_2", "cucumber_1"]
}'
curl -XPUT "http://localhost:9200/recipe-book/_doc/2" -H 'Content-Type: application/json' -d'
{
"favorite_foods": ["apple_1", "carrot_1", "cucumber_1"]
}'
curl -XPUT "http://localhost:9200/recipe-book/_doc/3" -H 'Content-Type: application/json' -d'
{
"favorite_foods": ["cucumber_3"]
}'
We took a look at the OpenSearch codebase, finding the code path by which traditional terms aggregations were performed. There were two “aggregator” classes that OpenSearch uses in a terms aggregation, either the GlobalOrdinalsStringTermsAggregator or the MapStringTermsAggregator. The Global Ordinals aggregation takes advantage of global ordinals, mappings of segment ordinals to their original locations at a “global” level. While this is the traditional aggregator used for terms aggregations, seeing that aggregations occur across shards and segments, we opted to base our custom implementation in the MapStringTermsAggregator.
The GlobalOrdinalsStringTermsAggregator deals in globalOrds represented as longs, which are not easily convertible into Strings. We need to deal with Strings in order to separate the suffix count from the delimiter. The MapStringTermsAggregator uses a representation of doc values that enables for ordinal representations to be converted to BytesRefs and subsequently to Strings.
Following further down the code path, we then modified the doc values typing used by the MapStringTermsAggregator, specifically the SortedBinaryDocValues type, and created a custom implementation that separated the suffix from our term (apple_3 → apple, 3) and created new buckets with these new terms as keys and suffixes as counts. We then sent “fake” values to the LeafCollector method that would mimic the bucket counts that we required. For example, the value apple_3 would be seen by the collector as apple, apple(fake), apple(fake).
On further iteration, caching via Hashmap was supported to prevent multiple ordinal lookups and multiple conversions from BytesRef to Strings, improving performance in the process.
Benchmarks & Takeaways
We now have three implementations that each net us the expected result. It now remains to see which implementation will offer us the best solution in accordance with our guidelines.
We used JMeter to test the query performance. Here are important points regarding how the implementations were benchmarked:
- Documents were indexed via Lucille, an open-source Java framework for ETL pipelines created by KMW Technology; Lucille supports a benchmarking workflow with randomised document creation
- 1 index was created per implementation, each within their own instance of OS
- 100k documents were indexed into each index
- Each document averaged at 1000 values in its field, with the range spanning from 750 to 1250
- All queries had random boolean match filters alongside the aggregations so as to prevent the impact of caches
- The underscore representation queries used a “prefix” match seeing that their values had an additional delimiter and count attached
- This “prefix” analysis may potentially have had ramifications for query performance for the underscore implementation
- 50 users sent queries simultaneously with a 1 second ramp-up time
Scripted Imp. | Underscore Imp. | Nested Imp. | |
# of samples | 50 | 50 | 50 |
Average Latency (ms) | 103907 ms | 34505 ms | 11877 ms |
Min Latency (ms) | 0 ms | 0 ms | 0 ms |
Max Latency (ms) | 147451 ms | 41623 ms | 16203 ms |
Std. Dev | 39522.05 | 7823.51 | 2761.71 |
Error % | 0% | 0% | 0% |
Throughput (requests per second) | 0% | 0% | 0% |
The scripted implementation saw the longest average query time at around 104 seconds. The underscore implementation sat in the middle with an average query time of around 35 seconds.The nested implementation was the fastest with an average query time of around 12 seconds.
With this information, it may seem clear that the nested approach is the best approach for handling duplicate terms aggregations in OpenSearch. However, taking a closer look at our proposed guidelines, we notice that we’ve yet to evaluate index size. Looking at index size benchmarks for both the nested and underscore implementations, we see a potentially new conclusion. The nested document index holds a large 1.4 GB index size. Meanwhile, the underscore index boasts a significantly smaller 187 MB index size. This translates to a 7.5x smaller index size for the underscore implementation, with a 2.9x slower average query time.
While we didn’t explicitly weigh each guideline’s importance, it was clear to us that a significantly larger index would present more problems than the query time performance hit, especially with the magnitudes mentioned before. This led us to believe that the underscore implementation was the best fit for our needs and the guidelines that we set forth. If index size is not a consideration for users, then the nested approach would be the best, seeing that it offers extremely fast lookup / query times.
The Plug-in
It now makes sense to package this solution into a plug-in. While OpenSearch does offer an easy-to-use template for developers to get started with plug-in development, the convenience stops shortly thereafter. We encountered some problems while developing the plug-in.
Difficulties in Plug-in Creation
The most challenging issue in plug-in development with OpenSearch has to do with Java class loaders. When installing a plug-in into an OpenSearch distribution, OpenSearch uses a parent class loader to load in OpenSearch related files and a child class loader to load in our plug-in files. OpenSearch files are largely package-private, preventing the child class loader from having access to their methods, constructors, etc. This causes large amounts of code duplication and requires a deep understanding of the inner workings of OpenSearch aggregations. Here is a link to a code example that illustrates the issues we encountered with class loaders: link.
Another issue faced was a lack of clear testing protocol for plug-ins. Tests situated within the OpenSearch core codebase, specifically aggregation tests, relied heavily on helper classes and methods found within the core codebase. Our plug-in code, due to the aforementioned class loader issue, did not have access to these assisting classes. This made it extremely difficult to write unit tests, leaving us only with integration tests.
Future Points of Interest
To say that our investigation exhausted all avenues for how to perform a duplicate terms aggregation would be naive. A few ideas were brainstormed (and even tested), but didn’t seem promising or did not fit our original guidelines. Here are a few:
Our plug-in implementation leveraged the SortedBinaryDocValues, the primary doc values data type for keyword field types. If we were open to changing the field type of our data, we could have experimented with the SortedOrderedNumericDocValues which stores duplicates for the numeric field type. This type is the only doc values field type in Lucene that retains duplicates. However, the issue would arise for a need to convert from numeric type to string at query time, leading to definite performance loss.
Another potential solution would be to utilise Lucene payloads. Payloads are metadata that can be stored with terms and accessed via Lucene. The concern regarding this solution was that payloads would likely not be accessible via doc_values, leaving them as a likely slower alternative.
One solution that was explored thoroughly was term vectors. Term vectors work by offering statistics for all the terms in the fields of a specific document, provided by the user. This already raised a few concerns, especially since we would require an artificial “master” document that contained all possible terms and fields so that its ‘id’ could be provided. If using this artificial document approach, term vectors would gather statistics from a randomly selected shard. The user must then choose to use only a single-shard instance or total the counts across all shards and create “master” documents within each shard. Despite this, we created code against Lucene itself comparing doc values to term vectors. For 10 million documents, term vectors seemed to be averaging 50 seconds slower than doc values.
Conclusion
Overall, we believe our investigation to be a thorough analysis of the existing possible tools to perform a duplicate terms aggregation. The plug-in we’ve created seems likely to be the best solution for duplicate terms aggregation depending on the individual context and use-case.
With this solution under our belt, we can now make our shopping list and make our great-grandmother proud!
June 23, 2024
March 29, 2023
December 17, 2022
November 17, 2022
September 30, 2022
July 2, 2022
June 12, 2021
July 15, 2020