Introduction
You want to optimize the following SQL command’s performance.
SELECT * FROM <table-name-1>
INNER JOIN <table-name-2> ON <table-name-1>.<column-1> = <table-name-2>.<column-1> OR <table-name-1>.<column-2> = <table-name-2>.<column-2>
Additional context
JOIN ON
uses the PhotonBroadcastNestedLoopJoin
. The nested loop approach in PhotonBroadcastNestedLoopJoin
can be slow for large datasets because it involves a Cartesian product-like operation (pairing every row from one dataset with every row from the other, creating all possible row-to-row combinations).
Instructions
Change your SQL code to use UNION instead.
SELECT * FROM <table-name-1>
INNER JOIN <table-name-2> ON <table-name-1>.<column-1> = <table-name-2>.<column-1>
UNION
SELECT * FROM <table-name-1>
INNER JOIN <table-name-2> ON <table-name-1>.<column-2> = <table-name-2>.<column-2>
UNION
uses the PhotonBroadcastHashJoin
to perform parallel joins, which are more efficient for joining large datasets. PhotonBroadcastHashJoin
uses a hash join algorithm where one side of the join is broadcast to all nodes, and a hash table is built on the broadcast side. The other side of the join then probes this hash table to find matching rows.