Sunday, May 27, 2012

Join Strategies

Join strategies depends on indexes, table sizes and selectivity (measure of distinct values in an index). 

Nested-loop Join

for each row in outer table {
    for each row in the inner table {
        if outer table join column matches inner table join column then add to temp table
        else fail
    }
}

The smaller size table should be the inner table so that it can be cached and reduce the number of page read.  If cache is too small, the DBMS needs to look for index to avoid scanning every row in the inner table.  Thus the table with good index should be the inner table.  Having good indexes in both tables is not necessary useful as the outer table only goes through once.  If there are multiple join columns, a compund index (c1, c2) is needed.

If there is a restrictive expression in the SQL, for example

select * from table1, table2
    where table1.c1 = table2.c1
    and table1.c2 = "abc"

the table with the restrictive expression should be the outer table.  If table1 is the inner table, the restrictive expression will need to be applied every time through the inner loop.

To speed up the inner loop, IBM hybrid join sort data so outer table access will be in the order of the join column on the inner table.  Scan the inner table and apply the join column, build a temporary table by appending the RID of the inner table to the outer table row Soprt the temporary table by RID order (in other words, the outer table is now in RID order of the inner table).  Retrieve the inner row based on the RID in the temporary table and form the result set.  Because the retrieval is in RID order, performance increase with prefetch.  Hybrid join is essentailly a nested join that requires the outer table be in sequence.

Sort-merge Join

/*sort phase*/
sort table1
sort table2

get first row of table1
get first row of table2

/*merge phase*/
for (;; until no more rows in tables) {
    if (join coumn in table1 < join column in table2) then get next row in table1
    else if (join column in table1 > join column in table2) then get next row in table2
    else if  (join column in table1 = join column in table2) then {
        pass
        get next row in table1
        get next row in table2
    }
}

This method is always going forward.  Compares to nested loop, sort-merge only scan the row once.  The down side is the cost of sorting.  If both tables are large, merge sort join is faster generally.  However, DBMS prefers nested loop because it requires less RAM and no start up overhead.  Sort merge join is favourable especially if both table are already in the right sequence (cluster index sequence).

Hash Join

for each row in inner, compute a hash
for each row in outer, compute a hash
    if hash value is in inner table hash bucket then pass
    else fail

RAM is required to hold the temporary table of hash.  Therefore enough RAM is a pre-requisite condition for this join.  The join must be a equal join.  Also the inner table join column must be unique and can be uniquely hashed.  Hash join is useful when the outer table is very large which means the inner table will be searched many time.

When more than 2 tables are joined, each join is processed separatedly.  There is no such thing as nested loop join within a nested loop.  One reason is that the joins could be processed in parallel.  Another reason is that nested loop within nested loop could keep the inner table out of cache.

No comments: