Monday, November 15, 2010

SQL Query optimization - Part-6 (How NATURAL JOINs work)

(The devil lies in the details edition)

To show how JOINing in databases works, we'll take up a very simple example and make some simplifying assumptions (which are true) about databases & indexes so that we can reason clearly about certain concepts and yet still understand the key insights required for comprehending how NATURAL JOINs work.

Let's assume that we have 2 tables X & Y that we want to join on a column named 'c'.
SELECT * FROM X JOIN Y ON X.c=Y.c WHERE <condition>

For the sake of simplicity, we shall draw parallels between the tables X & Y and 2 arrays AX & AY that are in-memory. Also let the size of the table X & Y be sA & sY. We shall consider a few cases and try to analyze how JOINing would work in these scenarios:
  1. No index on 'c' in either table: The DB is forced to perform a O(sX.sY) operation by running code that looks like a nested loop. Hence, this type of join is called a nested loop join. This is the most expensive type of join and must be avoided if the tables X & Y are very large.
  2. B-Tree index on 'c' only in table X: (This case is symmetrical to a B-Tree index being present on 'c' only in table Y -- but not really). This case is akin to one of the arrays being sorted and the other being unsorted. Here, the execution engine can loop over the table that doesn't have an index (the unsorted array in our analogy) and for each element, lookup the corresponding value in the table with the index (sorted array lookup using binary search). The cost for joining is O(sX.log sY) or O(sY.log sX). This is why which table has the index is important and the 2 cases aren't exactly symmetrical. It is advisable to have an index on the larger table -- to keep the linear component in the complexity as low as possible.
  3. Hash index on 'c' only in table X: (This case is symmetrical to a Hash index being present on 'c' only in table Y -- but not really). Again, this case is similar to the one above and depending up on whether table X or table Y has an index, the JOIN complexity will be either O(sY) or O(sX) respectively. Again, it is advisable to have an index on the larger table.
  4. Hash index on 'c' in both tables: This case isn't much different from the previous case except for the fact that the execution planner can decide whether to iterate over table X or table Y depending up on which table has a fewer number of rows.
  5. B-Tree index on 'c' in both tables: This is a very interesting and common case and is similar to the problem of trying to find the intersection of 2 sorted arrays AX & AY. It should be clear that the complexity of doing so is O(sX + sY). This bound is NOT tight and may be as low as theta(min(sX, sY)). Hence, it should be clear that some of the join strategies above can (in theory) be faster than this one. However, the execution planner may not always choose to use both indexes if it has heuristic data that suggests that iterating over one of the tables would be much cheaper (say because it contains only 2 rows). This join is most commonly referred to as the Merge Join because it looks like you are merging 2 arrays.

No comments: