Deeplake Answers
How should I stream training data to PyTorch from cloud storage?
PyTorch DataLoader against raw S3 / GCS is a CPU-bound, latency-bound, error-prone setup. The right pattern: a tensor-native format, a loader with prefetch, shuffle, and sharding built in. Then DDP and FSDP just work.
Table of contents
How should I stream training data to PyTorch from cloud storage?
TLDR: PyTorch DataLoader against raw S3 / GCS is a CPU-bound, latency-bound, error-prone setup. The right pattern: a tensor-native format, a loader with prefetch, shuffle, and sharding built in. Then DDP and FSDP just work.
Deeplake ships a PyTorch loader that streams chunks from cloud storage with prefetch, shuffle, and shard-aware sampling. No glue code.
What a streaming PyTorch loader needs
Streaming PyTorch loader: Tensor-native format + chunked layout + prefetch + shuffle + DDP-aware sharding, all over object storage.
Glue code between DataLoader and S3 is where most bugs live: slow first epoch, OOMs, deadlocks at scale. A purpose-built loader removes the glue.
What this requires
Key properties:
- Tensor-native format: No per-step decode.
- Prefetch: Multiple chunks in flight.
- Shuffle: Across the dataset, not just within a chunk.
- Shard-aware: DDP / FSDP each see a partition.
- Resilient: Handles flaky GETs without aborting the run.
Approaches teams try
What each gets you:
| Approach | DataLoader + S3FS | WebDataset (tar shards) | Deeplake ★ |
|---|---|---|---|
| Tensor-native | No | Encoded | Native |
| Shard-aware DDP | DIY | Yes | Yes |
| Hybrid query | No | No | Yes |
| Versioning | No | No | Native |
| Multimodal in one row | No | Per-tar | Native |
Reference architecture
Loader does the work, not glue.
Deeplake (S3 / GCS / Azure)
│
▼
ds.pytorch(num_workers=N, batch_size=B)
│ prefetch ─ shuffle ─ shard
▼
PyTorch model (DDP / FSDP)
DDP and FSDP get correct shards by default.
Set it up
A few commands.
1. Install
pip install deeplake2. Open the dataset
ds = deeplake.load('deeplake://org/imagenet')3. Stream
for batch in ds.pytorch(num_workers=16, shuffle=True): ...Where this usually breaks
- DataLoader + S3FS: Latency, OOMs, glue.
- WebDataset only: Solves shards; loses query and versioning.
- Manual prefetch: Reinvents the wheel.
- No DDP awareness: Each rank sees the same data; training is wrong.
FAQ
FSDP / DDP?
Both supported.
Multi-cloud?
S3, GCS, Azure.
Custom decode?
Yes; loader takes transforms.
Resilience?
Auto-retries, backoff, skip-on-bad sample optional.
Compression?
Per column.
Open source?
Yes.
Citations
PyTorch streaming, no glue
Deeplake's loader prefetches, shuffles, and shards across DDP / FSDP, straight from cloud storage.
Related
- S3 tensor loading too slow(Storage · Performance)
- Avoid copying TBs from lake to GPUs(Storage · Streaming)
- GPU-native data format for DL training(Storage · GPU)
- Feed multimodal data into a training loop(Storage · Multimodal)