Database Performance Deep Dive: UUIDs, Bloom Filters, Indexes & Scaling

Database Performance Deep Dive

Bloom Filters

Introduction

A memory-efficient probabilistic structure for checking element existence.

Core Concepts

Bit Array: Stores hashed values (e.g., 64-bit).
Hash Functions: Map elements to bit positions (e.g., hash("Paul") % 64 = 3).

Key Characteristics

Guarantees:
- No false negatives: Unset bit = element definitely absent.
- Possible false positives: Set bit = element might exist.
Use Cases:
- Avoid expensive DB checks (e.g., “Does username exist?”).
- Used in Cassandra for consistent hashing.

Implementation Steps

Initialize a bit array (e.g., 64 bits).
For insertion: Hash the element and set corresponding bits.
For queries: Check if all hashed bits are set.

Limitations

Bits fill up over time, leading to more false positives.
Requires tuning (size, hash functions) for accuracy.

Working with Billion-Row Tables

Introduction

Strategies to manage extremely large tables.

Core Strategies

Brute Force: Parallel processing (MapReduce).
Indexing: Reduce search space via B-trees.
Partitioning: Split tables horizontally (e.g., by user ID range).
Sharding: Distribute partitions across machines.
Schema Redesign: Avoid large tables (e.g., store JSON arrays).

Example: Twitter Followers

Original Design: followers table with billions of rows.
Optimized Design: Store follower lists as JSON in user profiles.

Trade-offs

Sharding improves scalability but complicates transactions.
Partitioning requires careful key selection.

How UUIDs in B+Tree Indexes Affect Performance

Introduction

Explores the performance challenges of using random UUIDs (version 4) in database indexes, particularly B+Tree structures.

Core Concepts

UUIDv4: 128-bit random identifiers with no inherent order.
B+Tree Indexes: Ordered data structures where leaf nodes store data pages.

Key Characteristics

Page Splits:
- Random inserts force frequent splits in B+Tree leaf nodes.
- Example: Inserting 80 between 10 and 90 splits the page, increasing I/O.
Buffer Pool Thrashing:
- Random access patterns overload shared memory buffers, causing frequent cache evictions.
Shopify’s Solution:
- Switched to ULID (time-ordered UUIDs) for sequential inserts, reducing splits.

Advantages & Disadvantages

Random UUIDs	Ordered UUIDs (ULID)
✔️ Globally unique	✔️ Reduced page splits
❌ High write latency	❌ Less random uniqueness

Practical Example

-- Problematic UUIDv4 usage in MySQL
CREATE TABLE orders (
    id BINARY(16) PRIMARY KEY, -- Random UUID
    ...
);

Conclusion

Use ordered IDs (e.g., ULID) for high-write tables to minimize splits and buffer churn.

The Cost of Long-Running Transactions

Introduction

Impact of failed long transactions in PostgreSQL.

Core Concepts

Row Versioning: Updates create new row versions; indexes must update.
HOT (Heap Only Tuples): Bypasses index updates if page has space.

Key Issues

Dead Rows: Rollbacks leave invalid rows, requiring VACUUM.
I/O Overhead: Pages with few live rows waste disk reads.

Cleanup Approaches

Eager (Undo Logs)	Lazy (PostgreSQL VACUUM)
✔️ Immediate cleanup	✔️ Non-blocking
❌ Slows rollbacks	❌ Delayed resource reuse

Conclusion

Monitor long transactions and tune VACUUM frequency to balance performance.

Microsoft SQL Server Clustered Index Design

Introduction

Differences between clustered and nonclustered indexes in SQL Server.

Core Concepts

Clustered Index:
- Leaf nodes = data pages; determines physical row order.
- One per table.
Nonclustered Index:
- Leaf nodes = index pages with pointers (RID or clustered keys).