Database Sharding and Consistent Hashing Explained

Database Sharding and Consistent Hashing

Introduction

Pain Point: Imagine your database queries slowing to a crawl as your table grows to millions of rows. Indexes bloat, writes bottleneck, and scaling vertically becomes prohibitively expensive. This is where sharding steps in as a last-resort solution.

Overview:
This lecture covers database sharding, a technique to horizontally split data across multiple servers. Key topics include consistent hashing, differences between horizontal partitioning and sharding, and a hands-on example using Postgres and NodeJS.

Real-World Failure Story: YouTube initially relied on a single MySQL server but faced crippling write loads as it scaled. They adopted application-level sharding, later transitioning to Vitess, a middleware that automates sharding logic.

Key Revelation: Sharding introduces significant complexity and should only be considered after exhausting other optimizations like caching, replication, and partitioning.

Core Concepts

What is Sharding?

Sharding splits a large table into smaller chunks (shards) distributed across multiple database instances.
Each shard has the same schema but holds a subset of data.
Partition key (e.g., URL ID) determines which shard stores a record.

Consistent Hashing

Maps data to servers using a hash function (e.g., hash(input) % number_of_shards).
Ensures the same input always routes to the same server, enabling predictable scaling.
Example: hash("input1") → Server 5432, hash("input2") → Server 5433.

Horizontal Partitioning vs Sharding

Horizontal Partitioning	Sharding
Splits data into tables within the same database.	Splits data across multiple databases.
Client knows partition names (e.g., `customers_west`).	Client uses consistent hashing to locate shards.

Key Characteristics

Scalability

Distributes data and load (CPU, memory) across servers.
Enables horizontal scaling for high-traffic systems.

Security

Restrict access to specific shards (e.g., VIP customer data on a secure shard).

Index Optimization

Smaller indexes per shard → faster queries.

Advantages & Disadvantages

Pros

Scalability: Handle larger datasets and higher traffic.
Fault Isolation: A failing shard doesn’t crash the entire system.
Security: Fine-grained access control per shard.

Cons

Complex Clients: Clients must know sharding logic.
No Cross-Shard Transactions: Atomic commits across shards are nearly impossible.
Schema Changes: Altering tables requires updating all shards.
Joins and Queries: Queries without partition keys must scan all shards.

Practical Implementation

Spin Up Postgres Shards with Docker

Create an init.sql script to define the table schema.
Build a custom Docker image to auto-run the script on startup.
Launch three Postgres instances on ports 5432, 5433, and 5434.

NodeJS Example (URL Shortener)

Consistent Hashing: Use hashring library to map URLs to shards.

Write Flow:

const hash = crypto.createHash("sha256").update(url).digest("base64");
const urlId = hash.substring(0, 5);
const serverPort = hashring.get(urlId);
await clients[serverPort].query("INSERT INTO url_table VALUES ($1, $2)", [
  url,
  urlId,
]);

Read Flow: Hash the URL ID to find the correct shard, then query it.

Conclusion

Sharding is powerful but complex. Use it only after optimizing caching, replication, and partitioning.
Consistent hashing ensures data is evenly and predictably distributed.
Transactions and joins remain major challenges in sharded systems.

Lecture Highlight:
“Sharding makes you sound smart, but it’s the last thing you want to do. Exhaust all other options first.”

Last updated Apr 28, 2025

This work is licensed under a Attribution-NonCommercial 4.0 International license.

Venmo	Paypal

PREVIOUSPostgreSQL Partitioning – Real-World Lessons from Hussein's Database Lectures

NEXTWhy SQL Pagination With Offset is Slow and How to Optimize It