System DesignMarch 26, 20269 min read

System Design: File Storage Service

system-designarchitecturestoragedistributed-systems

Amazon S3 stores over 100 trillion objects. Dropbox stores 1.5 exabytes. Google Drive handles billions of file operations per day. Object storage is the foundation of modern cloud infrastructure — everything from profile pictures to machine learning datasets to database backups lands in a system like this.

Let's design a file storage service that handles petabyte-scale data with high durability, low latency for reads, and cost-efficient storage tiers.

requirements

Functional:

Upload files (up to 5TB per file)
Download files by key
Delete files
List files with prefix filtering
Multipart upload for large files
Versioning (optional, per-bucket)
Pre-signed URLs for temporary access

Non-functional:

99.999999999% durability (11 nines — the S3 standard)
99.99% availability
Read latency under 100ms for hot data
Support for petabytes of total storage
Cost-efficient tiered storage (hot, warm, cold, archive)

high-level architecture

┌──────────┐     ┌──────────────┐     ┌──────────────────┐
│  Client   │────▶│  API Gateway  │────▶│  Metadata         │
│           │     │  (auth,       │     │  Service          │
│           │     │  routing)     │     │  (file index,     │
└──────────┘     └──────┬───────┘     │   permissions)    │
                        │              └────────┬─────────┘
                        │                       │
                  ┌─────▼──────┐          ┌─────▼──────┐
                  │  Upload /   │          │  Metadata   │
                  │  Download   │          │  DB         │
                  │  Service    │          │  (Postgres + │
                  └─────┬──────┘          │  DynamoDB)  │
                        │                 └────────────┘
                        │
          ┌─────────────┼─────────────┐
          │             │             │
    ┌─────▼────┐  ┌─────▼────┐  ┌────▼─────┐
    │  Storage  │  │  Storage  │  │  Storage  │
    │  Node 1   │  │  Node 2   │  │  Storage  │
    │  (disks)  │  │  (disks)  │  │  Node 3   │
    └──────────┘  └──────────┘  └──────────┘
                        │
                  ┌─────▼──────┐
                  │  CDN Edge   │
                  │  (hot data  │
                  │   cache)    │
                  └────────────┘

The system separates metadata (file name, size, permissions, chunk locations) from data (the actual bytes). This is the key architectural decision. Metadata is small, queried frequently, and needs strong consistency. Data is large, written once, and needs high throughput.

chunking: the foundation

Large files are split into fixed-size chunks (typically 64MB, matching HDFS/GFS block size). Each chunk is stored independently and replicated.

File Upload Flow:

Original file: 256MB video
         │
         ▼
┌──────────────────┐
│  Chunker          │
│  Split into 64MB  │
│  chunks           │
└──────┬───────────┘
       │
       ├── Chunk A (64MB) ─▶ hash: sha256_aaa ─▶ Node 1, 3, 5
       ├── Chunk B (64MB) ─▶ hash: sha256_bbb ─▶ Node 2, 4, 6
       ├── Chunk C (64MB) ─▶ hash: sha256_ccc ─▶ Node 1, 4, 5
       └── Chunk D (64MB) ─▶ hash: sha256_ddd ─▶ Node 2, 3, 6

Metadata record:
{
  file_id: "f_001",
  name: "video.mp4",
  size: 268435456,
  chunks: [
    {id: "sha256_aaa", nodes: [1, 3, 5]},
    {id: "sha256_bbb", nodes: [2, 4, 6]},
    {id: "sha256_ccc", nodes: [1, 4, 5]},
    {id: "sha256_ddd", nodes: [2, 3, 6]}
  ]
}

Why 64MB chunks? Smaller chunks mean more metadata overhead and more network calls during upload/download. Larger chunks waste space for small files and make replication slower. 64MB is the industry standard compromise. For small files (< 64MB), store the entire file as a single chunk.

Content-addressed storage: Each chunk is identified by its content hash (SHA-256). This enables deduplication — if two users upload the same file, it's stored once. The metadata entries point to the same chunks.

replication and durability

11 nines of durability means you'd lose one object per 10 million objects per 10,000 years. That requires serious redundancy.

Option 1: Replication. Store 3 copies of each chunk on different nodes, in different racks, ideally in different data centers. Simple and fast for reads (read from nearest copy). Expensive — 3x storage overhead.

Option 2: Erasure coding. Split each chunk into k data fragments and generate m parity fragments. Any k of the k+m fragments can reconstruct the original. Reed-Solomon coding with k=10, m=4 gives 1.4x storage overhead (vs. 3x for replication) with similar durability.

Erasure Coding (Reed-Solomon k=10, m=4):

Original chunk (64MB)
  │
  ▼ Split + encode
  ├── Data Fragment  1 (6.4MB) ──▶ Node 1
  ├── Data Fragment  2 (6.4MB) ──▶ Node 2
  ├── Data Fragment  3 (6.4MB) ──▶ Node 3
  ├── ... (fragments 4-10)
  ├── Parity Fragment 1 (6.4MB) ──▶ Node 11
  ├── Parity Fragment 2 (6.4MB) ──▶ Node 12
  ├── Parity Fragment 3 (6.4MB) ──▶ Node 13
  └── Parity Fragment 4 (6.4MB) ──▶ Node 14

Can lose ANY 4 nodes and still reconstruct.
Storage overhead: 14/10 = 1.4x (vs. 3x replication)

S3 uses erasure coding for durability and replication for availability. Hot data (frequently accessed) is replicated for fast reads. Cold data uses erasure coding to save storage costs.

the metadata layer

The metadata service is the brain. It maps human-readable paths to chunk locations and handles all the permission logic.

Metadata Schema:

buckets:
┌───────────┬──────────┬─────────────┐
│ bucket_id │ owner_id │ created_at  │
└───────────┴──────────┴─────────────┘

objects:
┌───────────┬───────────┬──────────┬──────┬──────────┐
│ object_id │ bucket_id │ key      │ size │ chunks   │
│           │           │ (path)   │      │ (JSON)   │
├───────────┼───────────┼──────────┼──────┼──────────┤
│ obj_001   │ b_photos  │ /2026/   │ 2.1M │ [chunk1] │
│           │           │ pic.jpg  │      │          │
└───────────┴───────────┴──────────┴──────┴──────────┘

Database choice matters. S3 originally used a custom key-value store. For our design:

DynamoDB (or Cassandra) for the object index — handles billions of keys with consistent hashing and horizontal scaling
Postgres for bucket metadata, permissions, and billing — needs strong consistency and transactions

Listing optimization: LIST /bucket/photos/2026/ needs to return all objects with that prefix efficiently. In DynamoDB, use the prefix as a sort key. In Cassandra, it's a range scan within a partition. This is why S3 prefixes matter for performance — they map directly to partition scans.

multipart upload

Large files need multipart upload. A 5TB file uploaded as a single HTTP request is guaranteed to fail (network timeouts, memory limits).

Multipart Upload Flow:

1. Client: POST /uploads → {upload_id: "up_123"}

2. Client uploads parts in parallel:
   PUT /uploads/up_123/part/1 ──▶ 64MB ──▶ stored, returns ETag
   PUT /uploads/up_123/part/2 ──▶ 64MB ──▶ stored, returns ETag
   PUT /uploads/up_123/part/3 ──▶ 64MB ──▶ stored, returns ETag
   (parts can be uploaded in any order)

3. Client: POST /uploads/up_123/complete
   body: [{part: 1, etag: "..."}, {part: 2, etag: "..."}, ...]
   → Server assembles metadata
   → Object becomes visible

Parallelism: The client uploads multiple parts simultaneously. With 10 parallel streams on a 1 Gbps connection, a 5GB file completes in ~50 seconds instead of 500 seconds.

Resume on failure: If part 37 of 100 fails, retry only part 37. The other 99 parts are already stored. This is why S3's multipart upload is the default for any file over 100MB.

Cleanup: If an upload is abandoned (no complete call), the parts sit as orphans consuming storage. Run a background job that deletes incomplete uploads older than 7 days. S3 has a lifecycle policy for this.

CDN integration

Files accessed frequently (profile pictures, product images, static assets) should be served from edge locations near the user.

Read Path with CDN:

User in Tokyo requests image:
┌────────┐     ┌──────────┐
│ Client  │────▶│ CDN Edge  │ ── cache hit ──▶ return (5ms)
│ Tokyo   │     │ Tokyo     │
└────────┘     └─────┬────┘
                     │ cache miss
                     ▼
               ┌───────────┐
               │  Origin     │
               │  (US-East)  │
               │  Storage    │
               └──────┬────┘
                      │
               return + cache at edge
               (next request: 5ms from Tokyo)

Cache-Control headers determine CDN behavior:

public, max-age=31536000, immutable for versioned assets (hash in filename)
private, no-cache for user-specific content
public, max-age=3600, stale-while-revalidate=86400 for semi-dynamic content

Pre-signed URLs solve the authentication problem. Generate a time-limited URL that grants access without exposing API keys. The CDN forwards the signed URL to the origin, which validates the signature before serving.

storage tiers

Not all data needs the same performance. A photo from 5 years ago is accessed 1000x less often than yesterday's upload. Tiered storage optimizes cost.

Storage Tiers:
┌──────────┬─────────────┬──────────┬───────────────┐
│ Tier     │ Access time │ Cost/GB  │ Use case      │
├──────────┼─────────────┼──────────┼───────────────┤
│ Hot      │ < 100ms     │ $0.023   │ Recent uploads│
│ Warm     │ < 100ms     │ $0.0125  │ 30-90 day old │
│ Cold     │ < 12 hours  │ $0.004   │ Archival      │
│ Glacier  │ < 48 hours  │ $0.00099 │ Compliance    │
└──────────┴─────────────┴──────────┴───────────────┘

Lifecycle policies automate tier transitions. "Move to warm after 30 days, cold after 90 days, glacier after 365 days." The metadata service tracks the current tier and routes reads accordingly. For cold/glacier, the read triggers a restore operation that makes the data available on hot storage temporarily.

key tradeoffs

Replication vs. erasure coding: Replication is simpler and faster for reads (just read one copy). Erasure coding uses 50% less storage but requires reading from multiple nodes and decoding. Use replication for hot data, erasure coding for cold data.

Consistency vs. performance on writes: Should a write be confirmed after the primary receives it, or after all replicas confirm? S3 achieves strong read-after-write consistency (since December 2020) by using a consensus protocol on the metadata. Data is eventually replicated but the metadata immediately reflects the new object.

Small file overhead: A 1KB file in a 64MB chunk system wastes space and metadata. Solution: pack small files into aggregate blocks (similar to tar archives). Store the offset and length in metadata. This is how HDFS handles small files and why S3 charges a per-request fee — small files cost them disproportionately in metadata overhead.

Deduplication scope: Content-addressed storage deduplicates within the system automatically. But cross-user deduplication raises privacy concerns — if User A's upload is instant because the file already exists (from User B), you've leaked that User B has the same file. Dropbox had this exact security issue. Solution: deduplicate within a user's account only, or use convergent encryption where the encryption key is derived from the content.

Start with replication for durability, a CDN for read performance, and lifecycle policies for cost management. Add erasure coding and deduplication when storage costs become a real problem.