15.13

Suppose you want to find documents that contain at least $k$ of a given set of $n$ keywords. Suppose also you have a keyword index that gives you a (sorted) list of identifiers of documents that contain a specified keyword. Give an efficient algorithm to find the desired set of documents.

Let $S$ be a set of $n$ keywords that we are given. An algorithm to find all documents that contain at least $k$ of these keywords is given below:

This algorithm calculates a reference count for each document identifier. A reference count of $i$ for a document identifier $d$ means that at least $i$ of the keywords in $S$ occur in the document identified by $d$ . The algorithm maintains a list of records, each having two fields - a document identifier, and the reference count for this identifier. This list is maintained sorted on the document identifier field.

Note that execution of the second for statement causes the list $D$ to “merge” with the list $L$ . Since the lists $L$ and $D$ are sorted, the time taken for this merge is proportional to the sum of the lengths of the two lists. Thus the algorithm runs in time (at most) proportional to $n$ times the sum total of the number of document identifiers corresponding to each keyword in $S$ .

2ndSem

Explorer

15.13

Graph View