19.7 Distributed Query Processing

Distributed query processing deals with optimizing queries that access data from multiple sites in a distributed database. Unlike centralized systems, which primarily focus on minimizing disk accesses, distributed systems must also consider:

  • Network transmission costs: The cost of sending data between sites.
  • Parallel processing benefits: The performance gains from executing parts of a query concurrently at different sites.

The relative costs of network transmission and disk access vary, so a good trade-off is essential.

19.7.1 Query Transformation

Goal: Convert a user’s query into an equivalent form that can be executed efficiently across distributed sites. The presence of data fragmentation and replication significantly complicates this.

Example Scenario

Consider the query “Find all tuples in the account relation.”

  1. Replication: If account is replicated, choose the replica with the lowest transmission cost.

  2. Fragmentation: If account is fragmented, the optimization is more complex because multiple joins or unions might be needed to reconstruct the relation.

Fragmentation Transparency Example

The user should be able to use the simple query:

SELECT *
FROM account
WHERE branch_name = "Hillside";

If, the account relation is horizontally fragmented, the query optimizer needs to make use of that. For Example: account is defined as: .

The system transforms this into:

Using the query-optimization techniques: .

Since contains tuples for “Hillside” and for “Valleyview”, the second part of the union is empty.

Final Optimized Strategy: Simply return .

19.7.2 Simple Join Processing

Goal: Efficiently compute joins across relations stored at different sites.

Example: Three-Relation Join

Consider:

Assume:

  • account is at site .
  • depositor is at site .
  • branch is at .
  • The result is needed at .

Possible Strategies

  1. Ship all to : Send copies of all three relations to and perform the join locally. This might require recreating indexes if they exist at other sites, which is expensive.

  2. Ship and Join Incrementally:

    • Ship account to .
    • Compute at .
    • Ship to .
    • Compute at .
    • Ship to .
  3. Other permutations of the roles of S1, S2, and S3 in the above strategies.

Choosing the Best Strategy

There’s no single best strategy. Factors to consider include:

  • Data volumes: How much data needs to be shipped between sites?
  • Transmission costs: Cost per block transferred between sites.
  • Processing speeds: Relative processing power of each site.
  • Presence of Indexes: If present, can improve the computation, if not, will require creation of indexes.

19.7.3 Semijoin Strategy

Goal: Reduce data transmission costs by eliminating tuples that don’t contribute to the final join result before shipping.

The Semijoin Operator ()

The semijoin of with , denoted as , is defined as:

Where:

  • is the schema (set of attributes) of .
  • The semijoin selects the tuples in that participate in the join with .

Semijoin Strategy Example ()

is stored at site , and is stored at site . The result should be at .

  1. Compute temp1: at . (Project the common attributes of ).
  2. Ship temp1: Send from to .
  3. Compute temp2: at . (Semijoin of with ).
  4. Ship temp2: Send from to .
  5. Compute Final Result: at .

Correctness

The final result () is equivalent to because join is associative and commutative, and

Efficiency

The strategy to find is good if a small fraction of tuples in contribute to the join. In this case, is much smaller than , reducing network transmission costs. The cost of transmitting is usually outweighed by these savings. The saving is maximum when is the result of the selection.

19.7.4 Join Strategies that Exploit Parallelism

Goal: Perform join operations concurrently at multiple sites to reduce overall query execution time.

Example: Four-Relation Join

Consider:

  • is stored at site .
  • Result is needed at .

Parallel Join Strategy

  1. Parallel Joins:
    • Ship to and compute at .
    • Simultaneously, ship to and compute at .
  2. Pipelining:
    • sends tuples of to as they are produced (pipelining).
    • sends tuples of to as they are produced.
  3. Final Join: starts computing as soon as it receives tuples from and .

Advantages

  • Concurrency: Multiple joins are computed in parallel.
  • Reduced Latency: The final join at can begin before the intermediate joins at and are fully completed, thanks to pipelining.