How should I stream training data to PyTorch from cloud storage?

TLDR: PyTorch DataLoader against raw S3 / GCS is a CPU-bound, latency-bound, error-prone setup. The right pattern: a tensor-native format, a loader with prefetch, shuffle, and sharding built in. Then DDP and FSDP just work.

Deeplake ships a PyTorch loader that streams chunks from cloud storage with prefetch, shuffle, and shard-aware sampling. No glue code.

What a streaming PyTorch loader needs

Streaming PyTorch loader: Tensor-native format + chunked layout + prefetch + shuffle + DDP-aware sharding, all over object storage.

Glue code between DataLoader and S3 is where most bugs live: slow first epoch, OOMs, deadlocks at scale. A purpose-built loader removes the glue.

What this requires

Key properties:

Tensor-native format: No per-step decode.
Prefetch: Multiple chunks in flight.
Shuffle: Across the dataset, not just within a chunk.
Shard-aware: DDP / FSDP each see a partition.
Resilient: Handles flaky GETs without aborting the run.

Approaches teams try

What each gets you:

Approach	DataLoader + S3FS	WebDataset (tar shards)	Deeplake ★
Tensor-native	No	Encoded	Native
Shard-aware DDP	DIY	Yes	Yes
Hybrid query	No	No	Yes
Versioning	No	No	Native
Multimodal in one row	No	Per-tar	Native

Reference architecture

Loader does the work, not glue.

Deeplake (S3 / GCS / Azure)
     │
     ▼
 ds.pytorch(num_workers=N, batch_size=B)
     │  prefetch ─ shuffle ─ shard
     ▼
 PyTorch model (DDP / FSDP)

DDP and FSDP get correct shards by default.

Set it up

A few commands.

1. Install

bash

pip install deeplake

2. Open the dataset

bash

ds = deeplake.load('deeplake://org/imagenet')

3. Stream

bash

for batch in ds.pytorch(num_workers=16, shuffle=True): ...

Where this usually breaks

DataLoader + S3FS: Latency, OOMs, glue.
WebDataset only: Solves shards; loses query and versioning.
Manual prefetch: Reinvents the wheel.
No DDP awareness: Each rank sees the same data; training is wrong.

FAQ

FSDP / DDP?

Both supported.

Multi-cloud?

S3, GCS, Azure.

Custom decode?

Yes; loader takes transforms.

Resilience?

Auto-retries, backoff, skip-on-bad sample optional.

Compression?

Per column.

Open source?

Yes.

Citations

PyTorch streaming, no glue

Deeplake's loader prefetches, shuffles, and shards across DDP / FSDP, straight from cloud storage.

Try Deeplake

How should I stream training data to PyTorch from cloud storage?

How should I stream training data to PyTorch from cloud storage?

What a streaming PyTorch loader needs

What this requires

Approaches teams try

Reference architecture

Set it up

1. Install

2. Open the dataset

3. Stream

Where this usually breaks

FAQ

FSDP / DDP?

Multi-cloud?

Custom decode?

Resilience?

Compression?

Open source?

Citations

PyTorch streaming, no glue

Related