Solr Payload Inequalities - KMW Technology

By Kevin Watters June 12, 2021

Introduction

New in Solr/Lucene 9.x is the ability to do inequality operations to search against payloads on terms. Inequalities provide the ability to search for values that are greater than, less than, or equal to some threshold.

As search and AI continue to converge, one common use case is to be able to search for documents that have been classified by a machine learned model. These models typically output a label and confidence score for each classification. Documents can have many classifications associated with them that are generated from many different machine learned models.

The result is a data model where you have a document and a list of associated classifications. Let’s explore three different approaches to indexing and querying these documents and classifications along with a fourth approach using our recent contribution back to the Lucene and Solr Apache projects: the inequality payload query operator.

We’ll start with a data model like the following which shows a relationship between a document and the document’s associated classifications.

As each document is associated with zero or more classifications, each with their own label and confidence score, we are presented with the challenge to answer queries with the index to find something like the following:

“Show me all of the images that have a high confidence that they are of a person.”

For this example query, we’ll take an example index that was created by running the COCO2017 image dataset through the Yolo and the VGG16 image classification neural networks. The output classifications by either model represent the objects that were detected in the image.

The VGG16 image provides classification of 1000 object types and their confidence levels. The Yolo model provides us with 80 classes of objects that have been found with their confidence and additionally bounding box coordinates of where the image was found.

Approach 1: Do it at index time

One common approach to answer this problem is to only tag the documents as being a “person” if the classification was above a particular threshold. Having a single field on the document for the “high confidence” classifications is a perfectly valid approach. The queries remain very simple, as they are just querying for a particular label in the classification field of the document.

Example:

high_confidence:person

However, this will prevent the user from searching for documents with medium confidence labels. To address this query, the indexing of the documents could stamp two fields, one with “high confidence” labels and another with “medium confidence” labels.

Then you can search the “medium confidence” field or the “high confidence” field for the label in question. The trade off here is that at query time a search across multiple fields is required. As requirements for the confidence filter become more granular, additional bucketed fields need to be created. If the definition of those confidence levels changes the documents would need to be re-indexed to recompute the proper labels for the proper bucket. This approach works only if the definition of high confidence label is known up front and will never change.

Example:

high_confidence:person OR medium_confidence:person

In reality, users want the ultimate flexibility to decide what they consider a high confidence classification. User interfaces might even want to present this as a slider control in a search UI.

Approach 2 Do it at query time with a field for the labels

To avoid having to re-index all the documents when the definition of high confidence changes, there are a few viable options at query time to solve this problem.

One approach is to store the confidence score for each label generated by the models in separate fields on the document, one field per label.

Once the documents are indexed with a field for each label’s score, a range search can be executed on the field that corresponds to the label that is being searched. In the above example query, searching for documents that have a “person” with high confidence could be implemented as a range search on the “person” field.

For example:

person:[0.75 TO *]

This approach has its drawbacks as well in that you might have multiple persons detected in an image, each with their own confidence level. So at best, you could choose the highest confidence level of the person classification and use that for search purposes or potentially use a multi-valued field to represent each score.

An additional drawback is that each possible label requires an additional field in the index. This approach could use dynamic fields unless all of the possible labels are be known up front. Some models might classify thousands of different object types. This would result in an explosion in the number of fields in the index. Aside from the potential for the number of fields to become unruly, the UI and query language would also need to know how to map the label to the appropriate field name to properly construct that query.

Approach 3: Use joins to perform the filtering

The next possible approach is to index each classification as a child document and do parent/child join queries against Solr. There is a join query parser in Solr that allows for all sorts of database-like join queries between two queries. These queries allow for relational queries between two different datasets. As wonderful as that sounds, (and it is wonderful!) there are some limitations and drawbacks to this approach. The first and most notable is that if you have a multi-shard index, special attention needs to be paid to how the documents are routed to the shards of the index. The key constraint for the parent/child document queries is that the documents need to be routed to the shards based on the value of their join key.

In this approach a document is indexed to represent each classification that is generated for the document being classified. Every document in the index will have zero or more classification documents that relate to it. For some applications it’s ok to change how the routing is working, but depending on the query use cases, this might not be possible.

The second notable drawback to the join query approach is that it is comparatively slow and very expensive to execute as compared to other simpler term and range queries.

Lastly, the join query potentially introduces a large number of additional documents to the index. This will impact the overall search latency for any given query. Query latency is proportional to the number of documents per shard in an index. So, the introduction of all the classification documents will affect the cluster sizing, as now the index should be sized to support the additional documents. That typically means more shards and potentially servers with additional CPUs.

The basic approach is to search on classifications but return the parent document.

Search all documents that have a child document that contains a label of “foo” with a confidence greater than 0.75.

For example:

{!join to=”id” from=”parent_id” v=”+label:foo +confidence:[0.75 TO *]”}

Approach 4: New & improved inequality payload support!

Upon review, it seemed that an alternative approach to use payloads might be able to solve this problem. The payload_check query parser in Solr provides access to the SpanPayloadCheckQuery. This Lucene query provides access to the payload data while matching. Once the term has been matched, the payload is then checked to see if it’s actually a match. This functionality was used to support things like part of speech tagging for terms. It allowed developers to use a delimited payload field in the Solr schema. The values in that field could then be encoded with the term and a value delimited by a pipe.

An example of Part of Speech tagged text using the delimited payload:

Lucene|NOUN is|VERB awesome|ADJ

In addition to the payloads being strings, an integer or a floating point number can be stored there and encoded like:

Person|0.82 Cat|0.75 Monitor|0.55

The only missing piece to being able to use this for the AI model classification search was the ability to do a range search on the payload instead of just an equals operation.

We had the pleasure of recently contributing this capability back to Lucene and Solr. The patch includes the ability to provide an operation while matching now to support common operations such as greater than, less than, in addition to the equals operator. We’re happy to see that it will be included in the upcoming Solr 9.0 release.

So, now we can express the desired query as:

{!payload_check f=”classifications_dpfs” op=”gt” payloads=”0.75″ v=”Person”}

The new payload check query parser allows the user to specify the operation and the payload value to use as the reference for that operation. The above example searches the “classifications_dpfs” field for any classification of “Person” where the payload has a value greater than 0.75.

So, with this last option we can use the payload check query parser in Solr to allow us to search for labels from the classification model that are above a threshold confidence score.

The ultimate result here is that we get the full ability to search on a term with one additional dimension of metadata filtering. Term and Payload. This unlocks a lot of other possibilities when it comes to image search.

One possible example is to put the number of people found as a payload. Yolo can detect multiple objects in a scene, so the user could search for something like:

“At least 2 people and a tennis racket.”

Having the classifier output from Yolo in the following format:

Person|2 Tennis_Racket|1

A similar approach can be taken if you want to encode the center point X and Y coordinates. This would enable searches like:

“Find me a picture with a Person on the left and a Pizza on the right.”

The capabilities increase further when you start putting together scene labels along with object classification, like:

Yolo + VGG16 to get context in the search results.

By combining these two computer vision models we can start asking questions beyond what Yolo alone can answer.

“Show me a picture of at least 2 people skiing”

Yolo will detect the number of people and the VGG16 model detects the act of skiing.

Benchmarks

First, some notes about how the benchmarks were generated:

Each of the 4 approaches has the same configset. All cache sizes (except for the perSegment cache) were set to zero size. All indices in the test were a single shard with one segment and no deleted documents. All of the indices represent 1,000,000 documents. Each document has an average of 50 classifications associated with it. Each classification has a random confidence score from 0 to 1.

Approach 1 Document Example: A multi-valued field that contains only the labels that had a confidence score greater than 0.75.

{

  "id":"doc_78375",
  "classification_ss":  ["label_7792","label_9689","label_2049","label_6581",
                         "label_4795","label_1780","label_6323","label_4939",
                         "label_8818","label_877","label_8524","label_3159",
                         "label_4776","label_1276","label_7104","label_4335",
                         "label_108","label_5675","label_424","label_7745",
                         "label_4144","label_2712","label_4782","label_9894",
                         "label_2777","label_1267","label_8919","label_5332",
                         "label_7159","label_4139","label_6714"]}

Approach 2 Document Example: A document with a field for each label containing the value of the confidence score for that label.

{
  "id":"doc_doc_20295",
  "label_264_fs":[0.5059256],
  "label_4777_fs":[0.8075591],
  "label_5773_fs":[0.6512147],
  "label_6584_fs":[0.44860387],
  "label_5053_fs":[0.77677464],
  "label_2307_fs":[0.99134594],
  "label_9447_fs":[0.6198554],
  "label_1534_fs":[0.993752],
  "label_2489_fs":[0.3812973],
  "label_1578_fs":[0.67712253],
  "label_7583_fs":[0.9276807],
  "label_796_fs":[0.9256864],
  "label_345_fs":[0.21647614],
  "label_8326_fs":[0.42797613],
  "label_6023_fs":[0.16159433],
  "label_3220_fs":[0.82546365],
  "label_4067_fs":[0.940214],
  "label_932_fs":[0.71239257],
  "label_7323_fs":[0.33341646],
  "label_5203_fs":[0.072936356],
  "label_8474_fs":[0.6330075],
  "label_2032_fs":[0.45183575],
  "label_6371_fs":[0.8656315],
  "label_3488_fs":[0.052257597],
  "label_4137_fs":[0.15416396],
  "label_6509_fs":[0.52511454],
  "label_8989_fs":[0.41732424],
  "label_1067_fs":[0.45847535],
  "label_5892_fs":[0.9040163],
  "label_8169_fs":[0.26057434],
  "label_3524_fs":[0.26224774],
  "label_4912_fs":[0.037523687],
  "label_3772_fs":[0.587384],
  "label_3116_fs":[0.2450012],
  "label_5069_fs":[0.70538366],
  "label_5939_fs":[0.90539235],
  "label_3369_fs":[0.4885615],
  "label_3798_fs":[0.35722762],
  "label_5111_fs":[0.3300156],
  "label_1684_fs":[0.18698442],
  "label_6735_fs":[0.7144106],
  "label_7830_fs":[0.79478115],
  "label_3240_fs":[0.7042537],
  "label_2410_fs":[0.09114766],
  "label_5419_fs":[0.83966637],
  "label_4207_fs":[0.0035191178],
  "label_4484_fs":[0.5441104],
  "label_6761_fs":[0.69559073],
  "label_2117_fs":[0.39736092],
  "label_1298_fs":[0.8430424],
  "label_5847_fs":[0.22964293],
  "label_3950_fs":[0.6538746],
  "label_6558_fs":[0.69730353],
  "label_1273_fs":[0.024785161],
  "label_3927_fs":[0.0013412237],
  "label_1759_fs":[0.75639284],
  "label_3553_fs":[0.5268485],
  "label_5143_fs":[0.8450275],
  "label_9301_fs":[0.12585384]}

Approach 3 Example Document: One document that represents the record being classified and an additional document for every classification associated with that document. The total number of documents in this index was 1,000,000 for the parent documents and 50,485,334 documents for each of the classification records.

# Example Parent Document

{
  "id":"doc_729413"
}

# Example Classification Document (on average 50 of these for each parent document.)

{
  "id":"doc_729413_5",
  "parent_id_s":"doc_729413",
  "label_s":"label_4067",
  "confidence_f":0.21850449
}

Approach 4 Example Document: A document with a single field containing all classification labels with the confidence level encoded as a payload.

{
  "id":"doc_doc_333114",
  "class_dpfs":["label_6872|0.7949082","label_3550|0.22995031",
                "label_7561|0.66999483","label_3575|0.76349306",
                "label_6567|0.29170412","label_9300|0.22878802",
                "label_7146|0.15249991","label_7853|0.26775044",
                "label_9330|0.69193286","label_315|0.3812859",
                "label_9941|0.60243666","label_5448|0.4909597",
                "label_3898|0.2772404","label_3798|0.58473474",
                "label_1040|0.39535648","label_3017|0.5128221",
                "label_5611|0.1511501","label_7777|0.93621653",
                "label_6589|0.48112237","label_6023|0.69422346",
                "label_6525|0.49891192","label_6580|0.41795957",
                "label_6807|0.5424252","label_6233|0.3097676",
                "label_624|0.9763578","label_1726|0.3847112",
                "label_9139|0.5317108","label_6330|0.64640564",
                "label_2413|0.24439555","label_1372|0.17762291",
                "label_562|0.3614515","label_7828|0.7642365",
                "label_3786|0.19018763","label_9409|0.75261647",
                "label_619|0.18773353","label_3496|0.4446562",
                "label_3556|0.43172425","label_7278|0.5559282",
                "label_6851|0.42342138"]}

Benchmark Results

Index Comparison

Metric	Approach 1	Approach 2	Approach 3	Approach 4	Comments
DPS	11761	209	409	3566	Documents indexed on a single thread generating random data. Updates batched, no committing while feeding content. Only a single replica active.
Index Size	219.32	906.99	2610	1220	Size in MB
Memory	1772	1361108	5556	1460	As reported from solr admin gui
Test Query	classification_ss:label_7792	label_7792_fs:[0.75 TO *]	{!join from=”parent_id_s” to=”id” v=”+label_s:label_7792 +confidence_f:[0.75 TO *]”}	{!payload_check op=”gt” f=”class_dpfs” payloads=”0.75″}label_7792	Example syntax

Indexing Benchmark Commentary

There are some interesting takeaways from this benchmark.

Indexing Performance

As expected, with Approach 1, the documents were faster to index as some data is being discarded. Pre-filtering the data yields smaller documents, and as a result they are faster to index. It was slightly surprising to see the negative impact of having many fields on a document.
Approach 2 documents were much larger overall compared to the rest of the approaches and as a result, this approach was the slowest to index overall.
Approach 3 with the children documents was about 2x faster than approach 2 for indexing.
Approach 4 with the payload was about 16 times faster Approach 2 to index. The payload approach definitely has a major indexing performance advantage over the other approaches.

Index Size

Approach 1 yielded the smallest index as data was being thrown away and the data model is the simplest.
Approach 2 was the most efficient in terms of index side between 2,3 and 4. This lends some of the performance issues to this approach to simply be attributed to the json formatting.
Approach 3 using children documents had the largest index size over all, showing the impact of the additional children documents in the index.
Approach 4 was only about 33% larger than the field based Approach 2 and less than ½ the size of Approach 3. The key item to note here is the additional .pay files in the index to store the payload data.

Memory Usage

Approach 4 had the lowest memory overhead for having the index open. This was a very surprising result.
Approach 2 yielded the largest surprise. The index heap usage was nearly 1000x that of the other approaches. This really highlights the overhead of having 10,000 fields in the index. This is relatively dramatic and shouldn’t be ignored.

Query Performance Comparison

Metrix	Approach 1	Approach 2	Approach 3	Approach 4	Comments
Number of queries	10000	10000	100*	10000	Join queries were so slow, the test was stopped after 100 queries
average	1	2	1803	3	In milliseconds
min	1	1	1656	2	ms
max	6	22	2092	47	ms
std dev	0.55	0.79	88.28	1.55
throughput	636	347.9	0.556666667	254	Queries per second (single threaded execution)
sent kb/sec	109	60.09	0.15	61.5
received kb/sec	6124	8988.84	0.82	948.88
avg bytes	9861	26461.3	1499	3826.1	Average response size in bytes

Query Benchmark Commentary

From a pure performance perspective, again, no surprise that Approach 1 is the fastest. The join query approach (Approach 3) really shows how its performance is orders of magnitude slower than simple term and range queries in Solr. The second best query performance was the field for each label (Approach 2). This really highlights how efficient Solr/Lucene is at performing range query operations. It also shows that it’s not much of a cost to use payloads compared to the expensive join operations.

One other thing to note about the payload approach (Approach 4) is that for this dataset, the json encoding of the document with the payloads is even tighter than Approach 1.

Next Steps

There are always things that can be improved and built upon. As a result of implementing this feature, it was noticed that the payload encoding and decoding code is a bit fragmented. It would be nice to centralize and consolidate some of that logic with the hope of making payloads more extensible.

There are some other novel queries that can be performed with the new payload support. For example, if using the classifications as a feature vector, a query can be created to find other documents that were classified in a similar manner to implement a “find similar” for image data.

Going beyond this, one could envision extending the payload supported data types to include a vector of floating point values to enable vector based matching calculations such as cosine, manhattan, or euclidean distance metrics for similarity.

The ultimate realization of the power of this functionality will ultimately be through NLU techniques to translate free text queries into the appropriate payload check queries. More to come on that.

Conclusion

The biggest take away here is that as you require more granular query capabilities, the expense of the query goes up. If an application truly needs to do a full relational style join query, then special attention needs to be paid to how that system is scaled. Additionally, we can also see that payloads become a very attractive design pattern due to the indexing performance, memory usage characteristics, query performance, and ultimately query granularity make it a viable approach for applications to avoid needing to scale up for using a join query.

Comments (1)

Building A Vector Search Application on Opensearch - KMW Technology

March 30, 2023 at 10:44 pm

[…] 30, 2022 Search Engine Upgrade By Kevin Watters July 2, 2022 Solr Payload Inequalities By Kevin Watters June 12, 2021 The Cross Collection Join Query By Dan Fox […]

Introduction

Benchmark Results

Index Comparison

Query Performance Comparison

Conclusion

Comments (1)

Leave a comment Cancel reply

Privacy Policy

INTERPRETATION AND DEFINITIONS

INTERPRETATION

DEFINITIONS

COLLECTING AND USING YOUR PERSONAL DATA

TYPES OF DATA COLLECTED

PERSONAL DATA

USAGE DATA

TRACKING TECHNOLOGIES AND COOKIES

EMBEDDED CONTENT & PLUGINS

GOOGLE WEB FONTS

YOUTUBE

AKISTMET

USE OF YOUR PERSONAL DATA

RETENTION OF YOUR PERSONAL DATA

TRANSFER OF YOUR PERSONAL DATA

DELETE YOUR PERSONAL DATA

DISCLOSURE OF YOUR PERSONAL DATA

BUSINESS TRANSACTIONS

LAW ENFORCEMENT

OTHER LEGAL REQUIREMENTS

SECURITY OF YOUR PERSONAL DATA

DETAILED INFORMATION ON THE PROCESSING OF YOUR PERSONAL DATA

ANALYTICS

GDPR PRIVACY

LEGAL BASIS FOR PROCESSING PERSONAL DATA UNDER GDPR

YOUR RIGHTS UNDER THE GDPR

EXERCISING OF YOUR GDPR DATA PROTECTION RIGHTS

CCPA PRIVACY

CATEGORIES OF PERSONAL INFORMATION COLLECTED

SOURCES OF PERSONAL INFORMATION

USE OF PERSONAL INFORMATION FOR BUSINESS PURPOSES OR COMMERCIAL PURPOSES

DISCLOSURE OF PERSONAL INFORMATION FOR BUSINESS PURPOSES OR COMMERCIAL PURPOSES

SALE OF PERSONAL INFORMATION

SHARE OF PERSONAL INFORMATION

SALE OF PERSONAL INFORMATION OF MINORS UNDER 16 YEARS OF AGE

YOUR RIGHTS UNDER THE CCPA

EXERCISING YOUR CCPA DATA PROTECTION RIGHTS

DO NOT SELL MY PERSONAL INFORMATION

WEBSITE

MOBILE DEVICES

“DO NOT TRACK” POLICY AS REQUIRED BY CALIFORNIA ONLINE PRIVACY PROTECTION ACT (CALOPPA)

CHILDREN’S PRIVACY

INFORMATION COLLECTED FROM CHILDREN UNDER THE AGE OF 13

PARENTAL ACCESS

YOUR CALIFORNIA PRIVACY RIGHTS (CALIFORNIA’S SHINE THE LIGHT LAW)

CALIFORNIA PRIVACY RIGHTS FOR MINOR USERS (CALIFORNIA BUSINESS AND PROFESSIONS CODE SECTION 22581)

LINKS TO OTHER WEBSITES

CHANGES TO THIS PRIVACY POLICY

CONTACT US