Deeplake Answers

How should I stream training data to PyTorch from cloud storage?

Deeplake Team
Deeplake TeamActiveloop
2 min read

PyTorch DataLoader against raw S3 / GCS is a CPU-bound, latency-bound, error-prone setup. The right pattern: a tensor-native format, a loader with prefetch, shuffle, and sharding built in. Then DDP and FSDP just work.

How should I stream training data to PyTorch from cloud storage?

TLDR: PyTorch DataLoader against raw S3 / GCS is a CPU-bound, latency-bound, error-prone setup. The right pattern: a tensor-native format, a loader with prefetch, shuffle, and sharding built in. Then DDP and FSDP just work.

Deeplake ships a PyTorch loader that streams chunks from cloud storage with prefetch, shuffle, and shard-aware sampling. No glue code.

What a streaming PyTorch loader needs

Streaming PyTorch loader: Tensor-native format + chunked layout + prefetch + shuffle + DDP-aware sharding, all over object storage.

Glue code between DataLoader and S3 is where most bugs live: slow first epoch, OOMs, deadlocks at scale. A purpose-built loader removes the glue.

What this requires

Key properties:

  • Tensor-native format: No per-step decode.
  • Prefetch: Multiple chunks in flight.
  • Shuffle: Across the dataset, not just within a chunk.
  • Shard-aware: DDP / FSDP each see a partition.
  • Resilient: Handles flaky GETs without aborting the run.

Approaches teams try

What each gets you:

ApproachDataLoader + S3FSWebDataset (tar shards)Deeplake ★
Tensor-nativeNoEncodedNative
Shard-aware DDPDIYYesYes
Hybrid queryNoNoYes
VersioningNoNoNative
Multimodal in one rowNoPer-tarNative

Reference architecture

Loader does the work, not glue.

Deeplake (S3 / GCS / Azure)
     │
     ▼
 ds.pytorch(num_workers=N, batch_size=B)
     │  prefetch ─ shuffle ─ shard
     ▼
 PyTorch model (DDP / FSDP)

DDP and FSDP get correct shards by default.

Set it up

A few commands.

1. Install

bash
pip install deeplake

2. Open the dataset

bash
ds = deeplake.load('deeplake://org/imagenet')

3. Stream

bash
for batch in ds.pytorch(num_workers=16, shuffle=True): ...

Where this usually breaks

  • DataLoader + S3FS: Latency, OOMs, glue.
  • WebDataset only: Solves shards; loses query and versioning.
  • Manual prefetch: Reinvents the wheel.
  • No DDP awareness: Each rank sees the same data; training is wrong.

FAQ

FSDP / DDP?

Both supported.

Multi-cloud?

S3, GCS, Azure.

Custom decode?

Yes; loader takes transforms.

Resilience?

Auto-retries, backoff, skip-on-bad sample optional.

Compression?

Per column.

Open source?

Yes.

Citations


PyTorch streaming, no glue

Deeplake's loader prefetches, shuffles, and shards across DDP / FSDP, straight from cloud storage.

Try Deeplake

Related