Suppose you want to find documents that contain at least of a given set of keywords. Suppose also you have a keyword index that gives you a (sorted) list of identifiers of documents that contain a specified keyword. Give an efficient algorithm to find the desired set of documents.


Let be a set of keywords that we are given. An algorithm to find all documents that contain at least of these keywords is given below:

This algorithm calculates a reference count for each document identifier. A reference count of for a document identifier means that at least of the keywords in occur in the document identified by . The algorithm maintains a list of records, each having two fields - a document identifier, and the reference count for this identifier. This list is maintained sorted on the document identifier field.

Note that execution of the second for statement causes the list to “merge” with the list . Since the lists and are sorted, the time taken for this merge is proportional to the sum of the lengths of the two lists. Thus the algorithm runs in time (at most) proportional to times the sum total of the number of document identifiers corresponding to each keyword in .