This is a short note to help understand the GenericJoin algorithm from Skew Strikes Back. This algorithm computes the (equi)join across several relations, and its performance is optimal in a certain sense1: if we fix the schema and size of each input relation but don’t otherwise constrain them, there is a maximal size \(S\) that their join can have; then on any problem instance, the algorithm will enumerate the output tuples in at most \(O(S)\) time (the derivation of \(S\) is also interesting; see my other note).
A one-dimensional join problem is one where each relation is a predicate (it has one column). Equivalently, it is a certain subset of values coming from that column’s value type. The join problem then reduces to set intersection (of two or more sets). In order to meet the bounds given in the paper, we need to be able to compute set intersections sufficiently quickly. There are two good ways to do this:
Using the second method introduces a log-factor into the overall runtime, but could be faster in practice in some cases.
This is called a multiway algorithm because it operates on two or more sets simultaneously without relying on intermediate results. In contrast, a binary algorithm would join relations two at a time.
Example: say we have relations \(R \subseteq A\times B\) and \(S \subseteq B\times C\) and \(T \subseteq A\times B\times C\). This problem is three-dimensional, because jointly the relations have three distinct columns (also called attributes).
If we have an \(n+2\) dimensional problem, we decompose it into two strictly smaller problems of dimension \(a\) and \(b\) so that \(a+b = n+2\). To compute the join, we
Note that at each point of time when we start step (2), we have a different partial join result coming from (1), so the relations being joined change at each point.
A particularly easy way to decompose is \(n+2 = 1 + (n+1)\). We solve a one-dimensional problem on the first column and then recurse on the rest.
In the example above, find the values of \(A\) that are shared by \(R\) and \(T\). For each of those values \(a \in A\), we have a single “row” \(R_a\) that is a subset of \(B\). Similarly, we have a slice of \(T\), \(T_a\), that is a subset of \(B\times C\). So, for each value of \(A\), we compute a join over \(B,C\): we handle this recursively (and \(2 = 1 + 1\), so we reach the base case after that).
So, we could compute this particular join with a nested set of loops, one for each column:
for a in R /\ T:
for b in R_a /\ S /\ T_a:
for c in S_b /\ T_ab:
yield (a, b, c)
note: I’m being a little sloppy with notation
note: typically we expect that all the relations are prepared into some sort of trie data structure so that their projections can be iterated quickly (with constant delay between successive tuples) and their slices can be accessed in constant time given the current partial join.