A Multiway Join Algorithm

This is a short note to help understand the GenericJoin algorithm from Skew Strikes Back. This algorithm computes the (equi)join across several relations, and its performance is optimal in a certain sense¹: if we fix the schema and size of each input relation but don’t otherwise constrain them, there is a maximal size \(S\) that their join can have; then on any problem instance, the algorithm will enumerate the output tuples in at most \(O(S)\) time (the derivation of \(S\) is also interesting; see my other note).

One Dimensional Multiway Joins

A one-dimensional join problem is one where each relation is a predicate (it has one column). Equivalently, it is a certain subset of values coming from that column’s value type. The join problem then reduces to set intersection (of two or more sets). In order to meet the bounds given in the paper, we need to be able to compute set intersections sufficiently quickly. There are two good ways to do this:

Iterate through the smallest set, store the other sets using hash tables, and probe them as we go, yielding tuples that are contained in all the sets.
Use ordered data structures (e.g. trees or sorted arrays) for all the sets and iterate them simultaneously, skipping past any values that can’t be in the intersection.

Using the second method introduces a log-factor into the overall runtime, but could be faster in practice in some cases.

This is called a multiway algorithm because it operates on two or more sets simultaneously without relying on intermediate results. In contrast, a binary algorithm would join relations two at a time.

\((n+2)\)-Dimensional Joins

Example: say we have relations \(R \subseteq A\times B\) and \(S \subseteq B\times C\) and \(T \subseteq A\times B\times C\). This problem is three-dimensional, because jointly the relations have three distinct columns (also called attributes).

If we have an \(n+2\) dimensional problem, we decompose it into two strictly smaller problems of dimension \(a\) and \(b\) so that \(a+b = n+2\). To compute the join, we

Project the relations onto their first \(a\) columns, and recursively compute that join, then
For each result tuple, recursively compute the join on the remaining \(b\) columns.

Note that at each point of time when we start step (2), we have a different partial join result coming from (1), so the relations being joined change at each point.

But really, how does that work?

A particularly easy way to decompose is \(n+2 = 1 + (n+1)\). We solve a one-dimensional problem on the first column and then recurse on the rest.

In the example above, find the values of \(A\) that are shared by \(R\) and \(T\). For each of those values \(a \in A\), we have a single “row” \(R_a\) that is a subset of \(B\). Similarly, we have a slice of \(T\), \(T_a\), that is a subset of \(B\times C\). So, for each value of \(A\), we compute a join over \(B,C\): we handle this recursively (and \(2 = 1 + 1\), so we reach the base case after that).

So, we could compute this particular join with a nested set of loops, one for each column:

  for a in R /\ T:
    for b in R_a /\ S /\ T_a:
      for c in S_b /\ T_ab:
        yield (a, b, c)

note: I’m being a little sloppy with notation

note: typically we expect that all the relations are prepared into some sort of trie data structure so that their projections can be iterated quickly (with constant delay between successive tuples) and their slices can be accessed in constant time given the current partial join.

https://cutfree.net