Consider the following extended relational-algebra operators. Describe how to implement each operation using sorting and using hashing.
a. Semijoin (): The multiset semijoin operator is defined as follows: if a tuple appears times in , it appears times in the result of if there is at least one tuple such that and satisfy predicate ; otherwise does not appear in the result.
b. Anti-semijoin (): The multiset anti-semijoin operator is defined as follows: if a tuple appears times in , it appears times in the result of if there does not exist any tuple in such that and satisfy predicate ; otherwise does not appear in the result.
As in the case of join algorithms, semijoin and anti-semijoin can be done efficiently if the join conditions are equijoin conditions. We describe below how to efficiently handle the case of equijoin conditions using sorting and hashing. With arbitrary join conditions, sorting and hashing cannot be used; (block) nested loop join needs to be used instead.
a. Semijoin:
- Semijoin using sorting: Sort both and on the join attributes in . Perform a scan of both and similar to the Merge-Join Algorithm (read section 15.5.4.1) and add tuples of to the result whenever the join attributes of the current tuples of and match. The following modifies pseudocode of the Merge-Join Algorithm given in the book to work for Semijoin:
- Semijoin using hashing: Create a hash index in on the join attributes in . Iterate over , and perform a hash lookup in . If the hash lookup returns a value, add the current tuple of r to the result. Note that if and are large, they can be partitioned on the join attributes first and the above procedure applied on each partition. If is small but is large, a hash index can be built on and probed using ; and if an tuple matches an tuple, the tuple and all of its copies can be output and deleted from the hash index.
b. Anti-semijoin:
-
Anti-semijoin using sorting: Sort both and on the join attributes in . Perform a scan of both and similar to the merge algorithm and add tuples of to the result if no tuple of satisfies the join predicate for the corresponding tuple of .
-
Anti-semijoin using hashing: Create a hash index in on the join attributes in . Iterate over , and for each distinct value of the join attributes, perform a hash lookup in . If the hash lookup returns a null value, add the current tuple of to the result. As for semijoin, partitioning can be used if and are large. An index on can be used instead of an index on , but then when an tuple matches an tuple, the tuple is deleted from the index. After processing all tuples, all remaining tuples in the index are output as the result of the anti-semijoin operation.