19.7 Distributed Query Processing

Distributed query processing deals with optimizing queries that access data from multiple sites in a distributed database. Unlike centralized systems, which primarily focus on minimizing disk accesses, distributed systems must also consider:

Network transmission costs: The cost of sending data between sites.
Parallel processing benefits: The performance gains from executing parts of a query concurrently at different sites.

The relative costs of network transmission and disk access vary, so a good trade-off is essential.

19.7.1 Query Transformation

Goal: Convert a user’s query into an equivalent form that can be executed efficiently across distributed sites. The presence of data fragmentation and replication significantly complicates this.

Example Scenario

Consider the query “Find all tuples in the account relation.”

Replication: If account is replicated, choose the replica with the lowest transmission cost.
Fragmentation: If account is fragmented, the optimization is more complex because multiple joins or unions might be needed to reconstruct the relation.

Fragmentation Transparency Example

The user should be able to use the simple query:

SELECT *
FROM account
WHERE branch_name = "Hillside";

If, the account relation is horizontally fragmented, the query optimizer needs to make use of that. For Example: account is defined as: $a cco u n t_{1} \cup a cco u n t_{2}$ .

The system transforms this into:

$σ_{b r an c h_nam e = "Hillside"} (a cco u n t_{1} \cup a cco u n t_{2})$

Using the query-optimization techniques: $σ_{b r an c h_nam e = "Hillside"} (a cco u n t_{1}) \cup σ_{b r an c h_nam e = "Hillside"} (a cco u n t_{2})$ .

Since $a cco u n t 1$ contains tuples for “Hillside” and $a cco u n t 2$ for “Valleyview”, the second part of the union is empty.

Final Optimized Strategy: Simply return $a cco u n t_{1}$ .

19.7.2 Simple Join Processing

Goal: Efficiently compute joins across relations stored at different sites.

Example: Three-Relation Join

Consider: $a cco u n t ⋈ d e p os i t or ⋈ b r an c h$

Assume:

account is at site $S_{1}$ .
depositor is at site $S_{2}$ .
branch is at $S_{3}$ .
The result is needed at $S_{1}$ .

Possible Strategies

Ship all to $S_{1}$ : Send copies of all three relations to $S_{1}$ and perform the join locally. This might require recreating indexes if they exist at other sites, which is expensive.
Ship and Join Incrementally:
- Ship account to $S_{2}$ .
- Compute $t e m p_{1} = a cco u n t ⋈ d e p os i t or$ at $S_{2}$ .
- Ship $t e m p_{1}$ to $S_{3}$ .
- Compute $t e m p_{2} = t e m p_{1} ⋈ b r an c h$ at $S_{3}$ .
- Ship $t e m p_{2}$ to $S_{1}$ .
Other permutations of the roles of S1, S2, and S3 in the above strategies.

Choosing the Best Strategy

There’s no single best strategy. Factors to consider include:

Data volumes: How much data needs to be shipped between sites?
Transmission costs: Cost per block transferred between sites.
Processing speeds: Relative processing power of each site.
Presence of Indexes: If present, can improve the computation, if not, will require creation of indexes.

19.7.3 Semijoin Strategy

Goal: Reduce data transmission costs by eliminating tuples that don’t contribute to the final join result before shipping.

The Semijoin Operator ( $⋉$ )

The semijoin of $r_{1}$ with $r_{2}$ , denoted as $r_{1} ⋉ r_{2}$ , is defined as:

$r_{1} ⋉ r_{2} = Π_{R_{1}} (r_{1} ⋈ r_{2})$

Where:

$R_{1}$ is the schema (set of attributes) of $r_{1}$ .
The semijoin selects the tuples in $r_{1}$ that participate in the join with $r_{2}$ .

Semijoin Strategy Example ( $r_{1} ⋈ r_{2}$ )

$r_{1}$ is stored at site $S_{1}$ , and $r_{2}$ is stored at site $S_{2}$ . The result should be at $S_{1}$ .

Compute temp1: $t e m p_{1} \leftarrow Π_{R_{1} \cap R_{2}} (r_{1})$ at $S_{1}$ . (Project the common attributes of $r_{1}$ ).
Ship temp1: Send $t e m p_{1}$ from $S_{1}$ to $S_{2}$ .
Compute temp2: $t e m p_{2} \leftarrow r_{2} ⋉ t e m p_{1}$ at $S_{2}$ . (Semijoin of $r_{2}$ with $t e m p_{1}$ ).
Ship temp2: Send $t e m p_{2}$ from $S_{2}$ to $S_{1}$ .
Compute Final Result: $r_{1} ⋈ t e m p_{2}$ at $S_{1}$ .

Correctness

The final result ( $r_{1} ⋈ t e m p_{2}$ ) is equivalent to $r_{1} ⋈ r_{2}$ because join is associative and commutative, and $Π_{R_{1} \cap R_{2}} (r_{1}) ⋈ r_{1} = r_{1} .$

Efficiency

The strategy to find $r_{1} ⋈ r_{2}$ is good if a small fraction of tuples in $r_{2}$ contribute to the join. In this case, $t e m p_{2}$ is much smaller than $r_{2}$ , reducing network transmission costs. The cost of transmitting $t e m p_{1}$ is usually outweighed by these savings. The saving is maximum when $r_{1}$ is the result of the selection.

19.7.4 Join Strategies that Exploit Parallelism

Goal: Perform join operations concurrently at multiple sites to reduce overall query execution time.

Example: Four-Relation Join

Consider: $r_{1} ⋈ r_{2} ⋈ r_{3} ⋈ r_{4}$

$r_{i}$ is stored at site $S_{i}$ .
Result is needed at $S_{1}$ .

Parallel Join Strategy

Parallel Joins:
- Ship $r_{1}$ to $S_{2}$ and compute $r_{1} ⋈ r_{2}$ at $S_{2}$ .
- Simultaneously, ship $r_{3}$ to $S_{4}$ and compute $r_{3} ⋈ r_{4}$ at $S_{4}$ .
Pipelining:
- $S_{2}$ sends tuples of $(r_{1} ⋈ r_{2})$ to $S_{1}$ as they are produced (pipelining).
- $S_{4}$ sends tuples of $(r_{3} ⋈ r_{4})$ to $S_{1}$ as they are produced.
Final Join: $S_{1}$ starts computing $(r_{1} ⋈ r_{2}) ⋈ (r_{3} ⋈ r_{4})$ as soon as it receives tuples from $S_{2}$ and $S_{4}$ .

Advantages

Concurrency: Multiple joins are computed in parallel.
Reduced Latency: The final join at $S_{1}$ can begin before the intermediate joins at $S_{2}$ and $S_{4}$ are fully completed, thanks to pipelining.

2ndSem

Explorer

19.7 Distributed Query Processing

19.7 Distributed Query Processing

19.7.1 Query Transformation

Example Scenario

Fragmentation Transparency Example

19.7.2 Simple Join Processing

Example: Three-Relation Join

Possible Strategies

Choosing the Best Strategy

19.7.3 Semijoin Strategy

The Semijoin Operator ( $⋉$ )

Semijoin Strategy Example ( $r_{1} ⋈ r_{2}$ )

Correctness

Efficiency

19.7.4 Join Strategies that Exploit Parallelism

Example: Four-Relation Join

Parallel Join Strategy

Advantages

Graph View

Table of Contents

Backlinks

2ndSem

Explorer

19.7 Distributed Query Processing

19.7 Distributed Query Processing

19.7.1 Query Transformation

Example Scenario

Fragmentation Transparency Example

19.7.2 Simple Join Processing

Example: Three-Relation Join

Possible Strategies

Choosing the Best Strategy

19.7.3 Semijoin Strategy

The Semijoin Operator (⋉)

Semijoin Strategy Example (r1​⋈r2​)

Correctness

Efficiency

19.7.4 Join Strategies that Exploit Parallelism

Example: Four-Relation Join

Parallel Join Strategy

Advantages

Graph View

Table of Contents

Backlinks

The Semijoin Operator ( $⋉$ )

Semijoin Strategy Example ( $r_{1} ⋈ r_{2}$ )