The built-in scoring mechanism in Elasticsearch and Solr can seem mysterious to beginners and experienced practitioners alike. Instead of delving into the mathematical definitions of TF-IDF and BM25, this article will help you develop an intuitive understanding of these metrics by walking you through a series of simple examples. Each example consists of a query and list of several indexed documents. As you read along, try to guess which document comes up on top for each query. In each case, we will examine why that particular document gets the highest score and we’ll extract the general principle behind this behavior. A set of six examples will be followed by an extra credit section focusing on more advanced topics. Along with illustrating all of the key behaviors of BM25, our examples will touch on some of the gotchas around scoring in cluster scenario, where shards and replicas come into play. This article aims to teach you, in a short time and without any math, everything you’ll ever need to know about scoring. Having a solid understanding of scoring will prepare you to better diagnose relevance problems and improve relevance in real-world applications.
Query 1: dog
Let’s say I search for “dog” and there are only three documents in my index, as shown below. Which one of these documents is going to come up on top?
Doc 1: "dog"
Doc 2: "dog dog"
Doc 3: "dog dog dog
If you’re not quite sure, that’s good, because I haven’t given you enough of the context to know the answer. All the queries in this article were tested in Elasticsearch 7.4 where BM25 is the default scoring algorithm and its parameters are set as k=1.2 and b=0.7. (Please ignore that if it’s meaningless to you.) In most of the examples, we’ll assume the documents have a single text field called “title” that uses the standard analyzer:
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard"
}
}
}
For the most part, we’ll be doing simple match queries against the title field. Recall that the default boolean operator in Elasticsearch is OR. Here’s what our “dog” query looks like:
GET test/_search
{
"query": {
"match": {
"title": "dog"
}
}
}
With those details out of the way, are you ready to tell me which one of those three documents (“dog,” “dog dog,” and “dog dog dog”) is going to get the highest score?
Here are the results:
ID | Title | Score |
1 | dog | 0.167 |
2 | dog dog | 0.183 |
3 | dog dog dog | 0.189 |
Doc 3 gets the highest score because it has contains the highest number of tokens that match the query term. Another way to say this is that Doc 3 has the highest term frequency for the term in question. Of course if we were using the keyword analyzer, only Doc 1 would have matched the query, but we’re using the standard analyzer which breaks the title into multiple tokens. The moral of this story is that, from a scoring perspective at least,
Before we move on, take a moment to compare Doc 1 and Doc 3. Notice that although Doc 3 has three times the term frequency for “dog” as Doc 1, its score isn’t three times as high. So while higher term frequency gives a higher score, its impact is not multiplicative.
Query 2: dog dog cat
Now I’m searching for “dog dog cat” and there are only two documents in my index:
Doc 1: "cat"
Doc 2: "dog"
Which one of these is going to come up on top? Or are they going to be tied?
In fact, “dog” is the winner here:
ID | Title | Score |
1 | cat | 0.6 |
2 | dog | 1.3 |
Why does “dog” get twice the score of “cat?” The lesson here is that
Our query has two instances of “dog.” The score for the whole query is the sum of the scores for each term. Each instance of “dog” in our query is going to match the “dog” in Doc 2 and contribute roughly 0.6 to the score, for a total of 1.3. Using the standard analyzer, query terms aren’t deduplicated, so each instance of “dog” is treated separately. Doc 1 doesn’t get a similar advantage because our query only contains one instance of “cat.”
Query 3: dog dog cat
Now I’m executing the same query as before, “dog dog cat,” but my index is different. This time I have lots of “dog” documents:
Doc 1: "dog"
Doc 2: "dog"
Doc 3: "dog"
Doc 4: "dog"
Doc 5: "dog"
Doc 6: "dog"
Doc 7: "cat"
What’s going to happen now? Do the “dog” documents still win over “cat” because my query mentions “dog” twice? Here are the results:
ID | Title | Score |
1 | dog | 0.4 |
2 | dog | 0.4 |
3 | dog | 0.4 |
4 | dog | 0.4 |
5 | dog | 0.4 |
6 | dog | 0.4 |
7 | cat | 1.5 |
The results are different this time because the terms have different document frequencies than before. A term’s document frequency is the number of documents in the index that contain the term. From a scoring perspective, low document frequency is good and high document frequency is bad. In this example, “cat” is a rare term in the index — it has low document frequency — so matches on that term help the score more than matches on “dog,” which is a common term. The lesson here is that:
If I want to tell the search engine that a common term is particularly important to me in a certain scenario, I can boost the term. If I had executed my query with a boost of 7 on “dog,” the dog documents would come up above the cat document. Here’s how I’d set that up:
GET test/_search
{
"query": {
"query_string": {
"query": "dog^7 cat",
"fields": ["title"]
}
}
}
Query 4: dog cat
In this example I’m searching for “dog cat,” and I’ve got three documents in my index: one with a lot of dogs, one with a lot of cats, and one with a single instance of dog and cat each, plus a lot of mats.
Doc 1: "dog dog dog dog dog dog dog"
Doc 2: "cat cat cat cat cat cat cat"
Doc 3: "dog cat mat mat mat mat mat"
Which document comes up on top this time? Notice that in Doc 1 and Doc 2, every single term matches one of the query terms, whereas in Doc 3 there are five terms that don’t match anything. So the results might be a little surprising:
ID | Title | Score |
1 | dog dog dog dog dog dog dog | 0.88 |
2 | cat cat cat cat cat cat cat | 0.88 |
3 | dog cat mat mat mat mat mat | 0.94 |
Document 3 gets the highest score because it has matches for both of the query terms, “dog” and “cat.” While Documents 1 and 2 have higher term frequency for “dog” and “cat” respectively, they each contain only one of the terms. The lesson is that
Query 5: dog
Now I’ll search for “dog” and there are only two documents in my index:
Doc 1: "dog cat zebra"
Doc 2: "dog cat
Both of these documents match my query, and both have some terms that don’t match. Which document does the best? Here are there results:
ID | Title | Score |
1 | dog cat zebra | 0.16 |
2 | dog cat | 0.19 |
In this case, Document 2 does better because it is shorter. The thinking is that when a term occurs in a shorter document, we can be more confident that the term is significant to the document (or that the document is about the term). When a term occurs in a longer document, we have less confidence that this occurrence is meaningful. The lesson here is that
Query 6: orange dog
Now let’s consider a scenario that’s a little more complicated than the previous ones. For the first time, our documents will have two fields, “color” and “type.”
Doc 1: {"color": "brown", "type": "dog"}
Doc 2: {"color": "brown", "type": "dog"}
Doc 3: {"color": "brown", "type": "cat"}
Doc 4: {"color": "orange", "type": "cat"}
We’re searching for “orange dog” but as you can see, there are no orange dogs in the index. There are two brown dogs, a brown cat, and an orange cat. Which one is going to come up on top?
I should mention that we’re searching across both fields using a multi_match like this:
GET test/_search
{
"query": {
"multi_match": {
"query": "orange dog",
"fields": ["type", "color"],
"type": "most_fields"
}
}
}
Here are the results:
ID | Color | Type | Score |
1 | brown | dog | 0.6 |
2 | brown | dog | 0.6 |
3 | brown | cat | N/A |
4 | orange | cat | 1.2 |
This example hints at some of the unexpected behavior that can arise when we search across multiple fields. The search engine doesn’t know which field is most important to us. If someone is searching for an orange dog, we might guess they’re more interested in seeing dogs than seeing arbitrary things that happen to be orange. (“Orange dog” would be very strange query to enter if you meant “show me anything that’s orange.”) In this case, however, the color field is taking priority because “orange” is a rare term within that field (there are 3 browns and only 1 orange). Within the type field, “dog” and “cat” have the same frequency. The orange cat comes up on top because a match for the rare term “orange” is treated as more valuable than a match for “dog.”
If we want to give more weight to the “type” field we can boost it like this:
GET test/_search
{
"query": {
"multi_match": {
"query": "orange dog",
"fields": [“type^2", "color"],
"type": "most_fields"
}
}
}
With the boost applied, the “brown dog” documents now score 1.3 and come up above the “orange cat.”
The lesson here is that searching across multiple fields can be tricky because the per-field scores are added together without concern for which fields are more important. To rectify this,
Query 7: dog
Now we’re moving into advanced territory although our query looks simpler than anything we’ve seen before. We’re searching for “dog” and our index has three identical “dog” documents:
Doc 1: "dog"
Doc 2: "dog"
Doc 3: "dog"
Which one is going to come up on top? You might guess that these three documents, being identical, should get the same score. So let’s take a moment to look at how ties are handled. When two documents have the same score, they’ll be sorted by their internal Lucene doc id. This internal id is different from the value in the document’s _id field, and it can differ even for the same document across replicas of a shard. If you really want ties to be broken in the same way regardless of which replica you hit, you can add a sort to your query, where you sort first by _score and then by a designated tiebreaker like _id or date.
But this point about tiebreaking is only an aside. When I actually ran this query, the documents didn’t come back with identical scores!
ID | Title | Score |
1 | dog | 0.28 |
2 | dog | 0.18 |
3 | dog | 0.18 |
How is it possible that identical documents would get different scores? The lesson in this example is that
Though I didn’t state it explicitly at the beginning of the post, all our previous examples were using one shard. In the current example, however, I set up my index with two shards. Document 1 landed on Shard 1 while Documents 2 and 3 landed on Shard 2. Document 1 got a higher score because within Shard 1, “dog” is a rarer term — it only occurs once. Within Shard 2, “dog” is more common — it occurs twice. Here’s how I set up the example:
PUT /test
{ "settings": { "number_of_shards": 2 } }
PUT /test/_doc/1?routing=0
{ "title" : "dog" }
PUT /test/_doc/2?routing=1
{ "title" : "dog" }
PUT /test/_doc/3?routing=1
{ "title" : "dog" }
If you’re working with multiple shards and you want scores to be consistent regardless of which shard a document lives in, you can do a Distributed Frequency Search by added the following parameter to your query: search_type=dfs_query_then_fetch. This tells Elasticsearch to retrieve term statistics from all the shards and combine them before computing the scores.
But its also important to know that
This is a consequence of how deletion is handled in Lucene. Documents that are marked for deletion but not yet physically removed (when their segments are merged) still contribute to term statistics. When a document is deleted, all the replicas will immediately “know” about the deletion, but they might not carry it out physically at the same time, so they might end up with different term statistics. To reduce the impact of this, you can specify a user or session ID in the shard copy preference parameter. This encourages Elasticsearch to route requests from the same user to the same replicas, so that, for example, a user will not notice scoring discrepancies when issuing the same query multiple times.
Its also important to know that, from a scoring perspective,
When you update a document in Lucene, a new version of it is written to disk and the old version is marked for deletion. But the old version continues to contribute to term statistics until it is physically deleted.
In the example below, I create a “dog cat” document and then I update its contents to be “dog zebra.” Immediately after the update, if I query for “dog” and look at the explain output, Elasticsearch tells me there are two documents containing the term “dog.” The number goes down to one after I do a _forcemerge. The moral: if you’re doing relevancy tuning in Elasticsearch, looking closely at scores, and also updating documents at the same time, be sure to run a _forcemerge after your updates, or else rebuild the index entirely.
PUT test/_doc/1
{ "title": "dog cat" }
GET test/_search?format=yaml
{ "query" : { "match" : { "title": "dog" } },
"explain": true }
PUT test/_doc/1?refresh
{ "title": "dog zebra" }
GET test/_search?format=yaml
{ "query" : { "match" : { "title": "dog" } },
"explain": true }
POST test/_forcemerge
GET test/_search?format=yaml
{ "query" : { "match" : { "title": "dog" } },
“explain": true }
Query 8: dog cat
Now let’s take another look at Query 4, where we searched for “dog cat” and we found that a document containing both terms did better than documents with lots of instances of one term or the other. In Query 4, we were searching the title field, the only field available. Here, we’ve got two fields, pet1 and pet2:
Doc 1: {"pet1": "dog", "pet2": "dog"}
Doc 2: {"pet1": "dog", "pet2": "cat"}
We’ll do a multi_match including those two fields like this:
GET test/_search
{
"query": {
"multi_match": {
"query": "dog cat",
"fields": [“pet1”, "pet2"],
"type": "most_fields"
}
}
}
Since Document 2 matches both of our query terms we’d certainly hope that it does better than Document 1. That’s the lesson we took away from Query 4, right? Well, here are the results:
ID | Pet 1 | Pet 2 | Score |
1 | dog | dog | 0.87 |
2 | dog | cat | 0.87 |
You can see the documents are tied. It might look like “cat” is a rare term that should help Document 2 rise to the top, but within the Pet 2 field, “cat” and “dog” have the same document frequencies, and the scoring is done on a per-field basis. It also looks like Document 2 should get an advantage for matching more of the query terms, but again, the scoring is done a per-field basis: when we compute the score in the Pet 1 field, both documents do the same; when we compute the score in the Pet 2 field, both documents again do the same.
Does this contradict what we learned from Query 4? Not quite, but it warrants a refinement of the earlier lesson:
If you’re not happy with this situation, there are some things you can do. You can combine the contents of Pet 1 and Pet 2 into a single field. You can also switch from the most_fields to the cross_fields query type to simulate a single field. (Just be aware that cross_fields has some other consequences on scoring, changing the procedure from field-centric to term-centric. We won’t go into details here.)
Query 9: orange dog
Now let’s revisit the lesson from Query 5, where we saw that matches in shorter fields are better. We’re going to search for “orange dog.” We have a “dog” document with a description mentioning that the dog is brown. And we have a “cat” document with a description mentioning that the cat is sometimes orange. Notice that the dog document has a longer description than the cat document, and both descriptions are longer than the contents of the type field.
Doc 1: {“type”: “dog”, “description”: “A sweet and loving pet that is always eager to play. Brown coat. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non nibh sagittis, mollis ex a, scelerisque nisl. Ut vitae pellentesque magna, ut tristique nisi. Maecenas ut urna a elit posuere scelerisque. Suspendisse vel urna turpis. Mauris viverra fermentum ullamcorper. Duis ac lacus nibh. Nulla auctor lacus in purus vulputate, maximus ultricies augue scelerisque.”}
Doc 2: {“type”: “cat”, “description”: “Puzzlingly grumpy. Occasionally turns orange.”}
We’ll do a multi_match like this:
GET test/_search
{
"query": {
"multi_match": {
"query": "orange dog",
"fields": ["type", “description"],
"type": "most_fields"
}
}
}
What’s going to take precedence here: the match for “dog” in the type field, which is a really short field, or the match for “orange” in the description field, which is significantly longer? If matches in shorter fields are better, shouldn’t “dog” win here? In fact, the results look like this:
ID | Type | Description | Score |
1 | dog | A sweet… brown… | 0.69 |
2 | cat | Puzzlingly grumpy… orange. | 1.06 |
The match for “orange” in the Description field is sending Document 2 to the top even though that match occurs in a longer field body than the match for “dog” in Document 1. Does this contradict what we learned about matches in short and long fields from Query 5? No, but it points to something we hadn’t mentioned. The lesson is that:
Within the type field, “dog” and “cat” don’t get an advantage for being short because, in fact, they’re both of average length for that field. On the other hand, the description for the cat is shorter than average for the description field overall, so it gets a benefit for being “short.”
Query 10: abcd efghijklmnopqrstuvwxyz
Here’s an easy one to close out our set of examples. I’ve divided the alphabet into two query terms. One of them is a short term, “abcd,” and the other is a long term, “efghijklmnopqrstuvwxyz.” I’m going to search for both terms together: “abcd efghijklmnopqrstuvwxyz.” My index has one match for each term:
Doc 1: "abcd"
Doc 2: "efghijklmnopqrstuvwxyz"
Which document is going to do the best? The results look like this:
ID | Title | Score |
1 | abcd | 0.69 |
2 | efghijklmnopqrstuvwxyz | 0.69 |
Why are the documents tied if it’s true that matches in shorter fields are better? The lesson is that
When we talk about a short or long field, we’re talking about how many terms the field contains, not how many characters. In this example, the titles for both documents are of length 1.
That’s it for now. Hopefully these examples have helped build your intuitions about how scoring works in Elasticsearch and Solr. Thanks for following along! If you’d like to go further and understand how the BM25 scoring function actually achieves the behaviors we’ve seen here, check out our companion article on Understanding TF-IDF and BM25.