ADBMS Parellel and Distributation Short Notes

1. PARALLEL AND DISTRIBUTED DATABASES

Parallel Database: Improves performance by parallelizing operations (loading data, building indexes, evaluating queries).
Distributed Database: Data stored across multiple independent sites.

1.1 ARCHITECTURES FOR PARALLEL DATABASES

Shared Memory: CPUs share common memory.
Shared Disk: CPUs have private memory but shared access to disks.
Shared Nothing: CPUs have independent memory and disk; communicate via network. Most scalable.
- Advantages: Linear Speed Up, Linear Scale Up

1.2 PARALLEL QUERY EVALUATION

Relational Query Execution Plan: Graph of relational algebra operators.
Pipelined Parallelism: Operator consumes output of another operator.
Data Partitioning: Divide data horizontally across disks for parallel processing.
- Round Robin: Assign tuples to processors in a round-robin fashion.
- Hash: Use a hash function to assign tuples to processors.
- Range: Partition data based on ranges of sort key values.
  - Data Skew: Uneven distribution of data, leading to performance bottlenecks.

1.3 PARALLELIZING INDIVIDUAL OPERATIONS

1.4 DISTRIBUTED DATABASES

Goal: Data physically distributed but logically integrated (transparent to user).
Properties:
- Distributed Data Independence: Access data without knowing location.
- Distributed Transaction Atomicity: Transactions atomic across sites.
Types:
- Homogeneous: Same DBMS software at all sites.
- Heterogeneous: Different DBMS software at different sites.
Architectures:
- Client-Server: Clients handle UI; servers manage data and transactions.
  - Advantages: Simple, efficient client resource utilization, better UI
- Collaborating Server: Servers cooperate to execute transactions spanning multiple sites.
- Middleware: Software layer coordinating queries/transactions across independent servers.

1.5 STORING DATA IN DDBS

Fragmentation: Break relation into smaller fragments stored at different sites.
- Horizontal: Fragments are subsets of rows.
- Vertical: Fragments are subsets of columns.
Replication: Store copies of relations/fragments at multiple sites.
- Advantages: Increased data availability, faster query evaluation.
Distributed Catalog Management:
- Naming Objects: Using Local name and Birth site field to uniquely identify each fragment.
- Catalog Structure: Centralized, Fully Distributed
- Distributed Data Independence: User should be able to query without knowing the location.

1.5.4 Distributed query processing:

1.5.4.3 Ship to one site:

2.5.2 Semi joins and bloomjoins:

Semijoin: Reduce data shipped by sending only necessary tuples.
- Steps: 1. Send projection. 2. Join at receiving site. 3. Send reduced data.
Bloomjoin: Similar to semijoin, but uses a bit-vector to filter out tuples.

1.5.4.4 Cost-Based Query Optimization:

1.6 DISTRIBUTED CONCURRENCY CONTROL AND RECOVERY

Issues:
- Distributed Concurrency Control: Manage locks, detect deadlocks across sites.
- Distributed Recovery: Ensure atomicity of transactions across sites.
Problems: Multiple data copies, site failures, communication link failures, distributed commit, distributed deadlock.
Lock Management:
- Centralized: Single site manages all locks.
- Primary Copy: Lock managed at site with primary copy.
- Fully Distributed: Lock managed at site with the copy being accessed.
Distributed Deadlock Detection:
- Centralized: Periodically send local waits-for graphs to a central site.
- Hierarchical: Group sites hierarchically; construct waits-for graphs at each level.
- Simple: Abort transaction if it waits longer than a timeout.
Distributed Recovery:
- Goal: All sub-transactions commit or none commit.
- Two-Phase Commit (2PC) Protocol:
  - Phase 1 (Prepare): Coordinator asks subordinates to prepare.
  - Phase 2 (Commit/Abort): Coordinator decides based on responses.

2ndSem