Sunday, December 23, 2012

Joins

Equi-Join is when the join columns are related using equality. e.g. T1.C1 = T2.C1
Theta-Join is when the join columns are related with > or < operators e.g. T1.C1 > T2.C1
Certain join methods only work with equi-join while some may work with any type of joins

When joining 2 tables, if there are rows that does not have matching join value, inner join will not return these rows. Outer-join on the other hand will return such rows in the result set. Example of such condition:

T1.CustID = T2.CustID - there are CUSTID in T1 not found in T2 and vice versa.

No-matching rows returned from outerjoin will have some of the fields (those in the table without the join value) empty.

Join Techniques

(1) Nested Loop

Used when the number of rows return is expected to be small and indexes are available to pick up those rows efficiently. Filter the matching row in driving table, look up the matching entry in the driven table via index. Use the ROWID to access the data.
Nested loop is efficient only if the select condition is highly restrictive. To retrieve each row, IO are incurred on both indexes and tables. If the result set is large, the IO cost could be higher than scanning through the table once to find all matching rows.

(2) Sort-Merge

There is no concept of driving or driven table. Both tables are equal. Rows from both tables are first sorted in join column order. The resulting tables are then merged to find the matching rows. This join technique use full table scan. If the sort work area is not large enough to accomodate the full table, additional IO will incur to swap the rows in and out of memory and increase retrieval time. There are cases that the optimizer piggyback the sort required by the ORDER BY or GROUP BY processing.

(3) Hash Join

It uses 2-phases processing as well. In the BUILD phase, picks the smaller (driving) table and places the rows into different bin by hashing the join value. In the PROBE phase, scan through the driven table, hash the join value to compute the slot, then compare with the join value already in the slot for a match.

One pit fall of hash join is when the join column has a skew distribution such that some slots contain no entry while some contain a very large number of entries. This makes the comparison in PROBE phase sub-optimal. The performance is also affected by the size of hash work area that holds the driving table.

(4) Cluster Join

It is a rare type of joins. The idea is to store the data chunks most likely to be needed at the same time in the same data block, thus speed up retrieval. The procesing is similar to nested loop. For each row in the driving row, use the cluster index to find the atching row in the driven table. These rows will probvably be in the same data block.
  Cluster join can be used when the join condition corresponding to a clustering key in the driven table.