Topic: 19.2 Distributed Data Storage
Core Idea: How to store data across multiple sites (machines) in a distributed database system.
Key Goals:
- Transparency: Users should not need to know where or how data is stored (ideally).
- Availability: Data should be accessible even if some sites fail.
- Performance: Minimize data access time (consider network latency).
- Scalability: Handle increasing amounts of data and users.
Main Techniques:
-
Replication:
- Concept: Store copies (replicas) of data at multiple sites.
- Advantages:
- Increased availability: If one site fails, data can be accessed from another.
- Improved performance: Reads can be served from a closer/faster replica.
- Disadvantages:
- Increased storage cost.
- Update complexity: Maintaining consistency across replicas requires careful handling (e.g., using commit protocols or lazy propagation).
- Types:
- Full replication: Every site has a copy of all data.
- Partial replication: Only some sites have copies of certain data.
- Primary Copy: One replica is designated as the primary, and all updates are directed there first.
-
Fragmentation:
- Concept: Divide a relation (table) into smaller fragments and store them at different sites.
- Advantages:
- Improved performance: Data can be stored closer to where it’s frequently accessed.
- Reduced storage overhead at each site (compared to full replication).
- Increased parallelism: Queries can be processed in parallel on different fragments.
- Disadvantages:
- More complex queries: Retrieving data might require joining fragments from multiple sites.
- Increased complexity of transaction management.
- Types:
- Horizontal Fragmentation:
- Divide a relation based on rows (tuples).
- Example: Split a
Customerstable intoCustomers_North,Customers_South, etc., based on location. - Each fragment has the same schema as the original relation.
- Usually, selection conditions are used to perform horizontal fragmentation.
- The original relation can be reconstructed using the union operation.
- Vertical Fragmentation:
- Divide a relation based on columns (attributes).
- Example: Split an
Employeetable intoEmployee_Names(ID, Name),Employee_Salaries(ID, Salary). - Each fragment has a subset of the original relation’s schema.
- Usually, a common candidate key (or tuple-ids) is included in all fragments to enable reconstruction.
- The original relation can be reconstructed using the join operation.
- Horizontal Fragmentation:
Transparency (ideally, users don’t need to know):
- Fragmentation Transparency: Users query the original relation, not individual fragments.
- Replication Transparency: Users are unaware of data replication.
- Location Transparency: Users don’t need to know the physical location of data.
Naming:
- Challenge: Ensuring unique names for data items (relations, fragments, replicas) across the distributed system.
- Solutions:
- Central name server: (Not recommended - single point of failure, bottleneck).
- Site prefixes: Each site prefixes its generated names with a unique site identifier (e.g.,
site1.account). - Aliases: User-friendly names mapped to the actual (prefixed) names.
Important Considerations:
- Data Distribution Strategy: How to decide which data to replicate and/or fragment, and where to store it. This depends on factors like access patterns, update frequency, site characteristics, and network topology.
- Directory/Catalog: The system needs a way to keep track of where data is stored (fragments, replicas) and how to access it. This information is typically stored in a distributed catalog or directory.
- Trade-offs: There are trade-offs between different approaches. For example, replication improves availability but increases update complexity. Fragmentation can improve performance but makes queries more complex.