Topic: 19.2 Distributed Data Storage

Core Idea: How to store data across multiple sites (machines) in a distributed database system.

Key Goals:

  • Transparency: Users should not need to know where or how data is stored (ideally).
  • Availability: Data should be accessible even if some sites fail.
  • Performance: Minimize data access time (consider network latency).
  • Scalability: Handle increasing amounts of data and users.

Main Techniques:

  1. Replication:

    • Concept: Store copies (replicas) of data at multiple sites.
    • Advantages:
      • Increased availability: If one site fails, data can be accessed from another.
      • Improved performance: Reads can be served from a closer/faster replica.
    • Disadvantages:
      • Increased storage cost.
      • Update complexity: Maintaining consistency across replicas requires careful handling (e.g., using commit protocols or lazy propagation).
    • Types:
      • Full replication: Every site has a copy of all data.
      • Partial replication: Only some sites have copies of certain data.
    • Primary Copy: One replica is designated as the primary, and all updates are directed there first.
  2. Fragmentation:

    • Concept: Divide a relation (table) into smaller fragments and store them at different sites.
    • Advantages:
      • Improved performance: Data can be stored closer to where it’s frequently accessed.
      • Reduced storage overhead at each site (compared to full replication).
      • Increased parallelism: Queries can be processed in parallel on different fragments.
    • Disadvantages:
      • More complex queries: Retrieving data might require joining fragments from multiple sites.
      • Increased complexity of transaction management.
    • Types:
      • Horizontal Fragmentation:
        • Divide a relation based on rows (tuples).
        • Example: Split a Customers table into Customers_North, Customers_South, etc., based on location.
        • Each fragment has the same schema as the original relation.
        • Usually, selection conditions are used to perform horizontal fragmentation.
        • The original relation can be reconstructed using the union operation.
      • Vertical Fragmentation:
        • Divide a relation based on columns (attributes).
        • Example: Split an Employee table into Employee_Names (ID, Name), Employee_Salaries (ID, Salary).
        • Each fragment has a subset of the original relation’s schema.
        • Usually, a common candidate key (or tuple-ids) is included in all fragments to enable reconstruction.
        • The original relation can be reconstructed using the join operation.

Transparency (ideally, users don’t need to know):

  • Fragmentation Transparency: Users query the original relation, not individual fragments.
  • Replication Transparency: Users are unaware of data replication.
  • Location Transparency: Users don’t need to know the physical location of data.

Naming:

  • Challenge: Ensuring unique names for data items (relations, fragments, replicas) across the distributed system.
  • Solutions:
    • Central name server: (Not recommended - single point of failure, bottleneck).
    • Site prefixes: Each site prefixes its generated names with a unique site identifier (e.g., site1.account).
    • Aliases: User-friendly names mapped to the actual (prefixed) names.

Important Considerations:

  • Data Distribution Strategy: How to decide which data to replicate and/or fragment, and where to store it. This depends on factors like access patterns, update frequency, site characteristics, and network topology.
  • Directory/Catalog: The system needs a way to keep track of where data is stored (fragments, replicas) and how to access it. This information is typically stored in a distributed catalog or directory.
  • Trade-offs: There are trade-offs between different approaches. For example, replication improves availability but increases update complexity. Fragmentation can improve performance but makes queries more complex.