19.7 Distributed Query Processing
Distributed query processing deals with optimizing queries that access data from multiple sites in a distributed database. Unlike centralized systems, which primarily focus on minimizing disk accesses, distributed systems must also consider:
- Network transmission costs: The cost of sending data between sites.
- Parallel processing benefits: The performance gains from executing parts of a query concurrently at different sites.
The relative costs of network transmission and disk access vary, so a good trade-off is essential.
19.7.1 Query Transformation
Goal: Convert a user’s query into an equivalent form that can be executed efficiently across distributed sites. The presence of data fragmentation and replication significantly complicates this.
Example Scenario
Consider the query “Find all tuples in the account relation.”
-
Replication: If
accountis replicated, choose the replica with the lowest transmission cost. -
Fragmentation: If
accountis fragmented, the optimization is more complex because multiple joins or unions might be needed to reconstruct the relation.
Fragmentation Transparency Example
The user should be able to use the simple query:
SELECT *
FROM account
WHERE branch_name = "Hillside";If, the account relation is horizontally fragmented, the query optimizer needs to make use of that. For Example:
account is defined as:
.
The system transforms this into:
Using the query-optimization techniques: .
Since contains tuples for “Hillside” and for “Valleyview”, the second part of the union is empty.
Final Optimized Strategy: Simply return .
19.7.2 Simple Join Processing
Goal: Efficiently compute joins across relations stored at different sites.
Example: Three-Relation Join
Consider:
Assume:
accountis at site .depositoris at site .branchis at .- The result is needed at .
Possible Strategies
-
Ship all to : Send copies of all three relations to and perform the join locally. This might require recreating indexes if they exist at other sites, which is expensive.
-
Ship and Join Incrementally:
- Ship
accountto . - Compute at .
- Ship to .
- Compute at .
- Ship to .
- Ship
-
Other permutations of the roles of S1, S2, and S3 in the above strategies.
Choosing the Best Strategy
There’s no single best strategy. Factors to consider include:
- Data volumes: How much data needs to be shipped between sites?
- Transmission costs: Cost per block transferred between sites.
- Processing speeds: Relative processing power of each site.
- Presence of Indexes: If present, can improve the computation, if not, will require creation of indexes.
19.7.3 Semijoin Strategy
Goal: Reduce data transmission costs by eliminating tuples that don’t contribute to the final join result before shipping.
The Semijoin Operator ()
The semijoin of with , denoted as , is defined as:
Where:
- is the schema (set of attributes) of .
- The semijoin selects the tuples in that participate in the join with .
Semijoin Strategy Example ()
is stored at site , and is stored at site . The result should be at .
- Compute
temp1: at . (Project the common attributes of ). - Ship
temp1: Send from to . - Compute
temp2: at . (Semijoin of with ). - Ship
temp2: Send from to . - Compute Final Result: at .
Correctness
The final result () is equivalent to because join is associative and commutative, and
Efficiency
The strategy to find is good if a small fraction of tuples in contribute to the join. In this case, is much smaller than , reducing network transmission costs. The cost of transmitting is usually outweighed by these savings. The saving is maximum when is the result of the selection.
19.7.4 Join Strategies that Exploit Parallelism
Goal: Perform join operations concurrently at multiple sites to reduce overall query execution time.
Example: Four-Relation Join
Consider:
- is stored at site .
- Result is needed at .
Parallel Join Strategy
- Parallel Joins:
- Ship to and compute at .
- Simultaneously, ship to and compute at .
- Pipelining:
- sends tuples of to as they are produced (pipelining).
- sends tuples of to as they are produced.
- Final Join: starts computing as soon as it receives tuples from and .
Advantages
- Concurrency: Multiple joins are computed in parallel.
- Reduced Latency: The final join at can begin before the intermediate joins at and are fully completed, thanks to pipelining.