For the first time ever Solr can now support a multi-sharded, distributed join query. As of Solr 8.6, there is a new join method; the Cross Collection Join Query (SOLR-13749). Currently there are multiple ways to perform a join query in Solr: the Block Join query parser, Join query parser, Graph query parser and Streaming Expressions all can provide some notion of join capabilities. As the name implies, the Cross Collection Join query is specifically built to allow for queries to join documents across two different, distributed collections. This post covers the technical implementation of the Cross Collection Join, we’ll be following this up soon with another post dedicated to sample use cases and benchmarks for using the new Cross Collection Join query.
How it works
The Cross Collection Join is a new join method that executes a query against a remote Solr collection to retrieve the set of join keys that will be used as a filter query against the local Solr collection. The CrossCollectionJoinQuery will first query a remote solr collection and get back a streaming expression result of the join keys. As the join keys are streamed to the node, a bitset of the matching documents in the local index is built up. This avoids keeping the full set of join keys in memory at any given time. This bitset is then inserted into the filter cache upon successful execution as with the normal behavior of the solr filter cache.
Additionally, if the local index is sharded according to the join key field, the Cross Collection query can leverage a secondary query parser called the “hash_range” query parser. The hash_range query parser is responsible for returning only the documents that hash to a given range of values. If the local collection is routed based on the join key used, each shard participating in the Cross Collection Join Query will be able to retrieve only the join keys from the remote collection that potentially match in the local shard. This has the benefit of making sure that network traffic doesn’t increase as the number of shards increases and allows for much greater scalability.
The Cross Collection Join Query works with both String and Point types of fields. The field that is being used for the join key in the remote collection must also have docValues enabled. The field that is being used for the join key in the local collection must be a single valued string field in order to take advantage for the hash range optimization described above. Additionally, the documents in the local collection must be routed to shard based on the join key field for the optimization.
The remote solr collection does not have any specific sharding requirements.
Example usage
{!join method="crossCollection" fromIndex="otherCollection" from="fromField" to="toField"}*:*
or
{!join method="crossCollection" fromIndex="otherCollection" from="fromField" to="toField" v="*:*"} (preferred)
{!join method="crossCollection" fromIndex="otherCollection" from="fromField" to="toField"}*:* or {!join method="crossCollection" fromIndex="otherCollection" from="fromField" to="toField" v="*:*"} (preferred)
glossary of terms
Term | Description |
Local Collection | This is the main collection that is being queried. |
Remote Collection | This is the collection that the Cross Collection Join Query will query to resolve the join keys. |
CrossCollectionJoinQuery | The lucene query that executes a search to get back a set of join keys from a remote collection |
HashRangeQuery | The lucene query that matches only the documents whose hash code on a field falls within a specified range. |
cross collection join query parameters
Param | Required | Description |
All others | Any normal Solr parameter can also be specified as a local param. | |
fromIndex | Required | The name of the external Solr collection to be queried to retrieve the set of join key values (required) |
from | Required | The join key field name in the external collection (required) |
routed | Optional | true / false. If true, the Cross Collection Join query will use each shard’s hash range to determine the set of join keys to retrieve for that shard. This parameter improves the performance of the cross-collection join, but it depends on the local collection being routed by the toField. If this parameter is not specified, the Cross Collection Join query will try to determine the correct value automatically. |
solrUrl | Optional | The URL of the external Solr node to be queried (optional). The solrUrl must be added to the “allowSolrUrls” list in the solrconfig.xml file |
to | Required | The join key field name in the local collection |
ttl | Optional | The length of time that a Cross Collection Join query in the cache will be considered valid, in seconds. Defaults to 3600 (one hour). The Cross Collection Join query will not be aware of changes to the remote collection, so if the remote collection is updated, cached Cross Collection Join queries may give inaccurate results. After the ttl period has expired, the Cross Collection Join query will re-execute the join against the remote collection. |
v | See Note | The query to be executed against the external Solr collection to retrieve the set of join key values. Note: The original query can be passed at the end of the string or as the “v” parameter. It’s recommended to use query parameter substitution with the “v” parameter to ensure no issues arise with the default query parsers. |
zkHost | Optional | The connection string to be used to connect to Zookeeper. zkHost and solrUrl are both optional parameters, and at most one of them should be specified. If neither of zkHost or solrUrl are specified, the local Zookeeper cluster will be used. (optional) |
Installation
In order to use the Hash Range query optimization, the solrconfig.xml for local and remote collections needs to be updated as defined below.
Solrconfig.xml – Local Collection
<queryParser name="join" class="org.apache.solr.search.JoinQParserPlugin">
<str name="routerField">product_id</str>
</queryParser>
Normally, this query parser is automatically registered by Solr. In order to take advantage of the Hash Range optimization the routedField property must be set on the JoinQParser in the Solrconfig .xml. In this example, we’ve specified the “product_id” field to indicate the field name that will support the routed=true parameter optimization.
Solrconfig.xml – Remote Collection
<cache name="hash_product_id"
class="solr.LRUCache"
size="128"
initialSize="0"
regenerator="solr.NoOpRegenerator"/>
Make sure to define the cache for the Hash Range query with a size equal to the maximum number of segments that are expected to be in the index. (typically this number stays well below 100.)
potential issues
- A large number of join keys will take a lot of network bandwidth to transmit to all the shards.
- The hash_product_id cache will not explicitly expunge entries until the max size is reached.
[…] 2, 2022 Solr Payload Inequalities By Kevin Watters June 12, 2021 The Cross Collection Join Query By Dan Fox July 15, 2020 Solr JSON Facets for Reporting and Data Aggregation […]