Topic: 19.2 Distributed Data Storage

Core Idea: How to store data across multiple sites (machines) in a distributed database system.

Key Goals:

Transparency: Users should not need to know where or how data is stored (ideally).
Availability: Data should be accessible even if some sites fail.
Performance: Minimize data access time (consider network latency).
Scalability: Handle increasing amounts of data and users.

Main Techniques:

Replication:
- Concept: Store copies (replicas) of data at multiple sites.
- Advantages:
  - Increased availability: If one site fails, data can be accessed from another.
  - Improved performance: Reads can be served from a closer/faster replica.
- Disadvantages:
  - Increased storage cost.
  - Update complexity: Maintaining consistency across replicas requires careful handling (e.g., using commit protocols or lazy propagation).
- Types:
  - Full replication: Every site has a copy of all data.
  - Partial replication: Only some sites have copies of certain data.
- Primary Copy: One replica is designated as the primary, and all updates are directed there first.
Fragmentation:
- Concept: Divide a relation (table) into smaller fragments and store them at different sites.
- Advantages:
  - Improved performance: Data can be stored closer to where it’s frequently accessed.
  - Reduced storage overhead at each site (compared to full replication).
  - Increased parallelism: Queries can be processed in parallel on different fragments.
- Disadvantages:
  - More complex queries: Retrieving data might require joining fragments from multiple sites.
  - Increased complexity of transaction management.
- Types:
  - Horizontal Fragmentation:
    - Divide a relation based on rows (tuples).
    - Example: Split a Customers table into Customers_North, Customers_South, etc., based on location.
    - Each fragment has the same schema as the original relation.
    - Usually, selection conditions are used to perform horizontal fragmentation.
    - The original relation can be reconstructed using the union operation.
  - Vertical Fragmentation:
    - Divide a relation based on columns (attributes).
    - Example: Split an Employee table into Employee_Names (ID, Name), Employee_Salaries (ID, Salary).
    - Each fragment has a subset of the original relation’s schema.
    - Usually, a common candidate key (or tuple-ids) is included in all fragments to enable reconstruction.
    - The original relation can be reconstructed using the join operation.

Transparency (ideally, users don’t need to know):

Fragmentation Transparency: Users query the original relation, not individual fragments.
Replication Transparency: Users are unaware of data replication.
Location Transparency: Users don’t need to know the physical location of data.

Naming:

Challenge: Ensuring unique names for data items (relations, fragments, replicas) across the distributed system.
Solutions:
- Central name server: (Not recommended - single point of failure, bottleneck).
- Site prefixes: Each site prefixes its generated names with a unique site identifier (e.g., site1.account).
- Aliases: User-friendly names mapped to the actual (prefixed) names.

Important Considerations:

Data Distribution Strategy: How to decide which data to replicate and/or fragment, and where to store it. This depends on factors like access patterns, update frequency, site characteristics, and network topology.
Directory/Catalog: The system needs a way to keep track of where data is stored (fragments, replicas) and how to access it. This information is typically stored in a distributed catalog or directory.
Trade-offs: There are trade-offs between different approaches. For example, replication improves availability but increases update complexity. Fragmentation can improve performance but makes queries more complex.

2ndSem

Explorer

19.2 Distributed Data Storage

Topic: 19.2 Distributed Data Storage

Graph View

Backlinks