Deeplake Answers
How do I avoid copying terabytes from a data lake to GPU nodes?
The TB-copy pattern is a relic: pull from the lake to local SSD, then start training. It wastes hours per run, scales worse than linearly, and breaks in multi-node. The fix is reading directly from object storage with a format that streams.
Table of contents
How do I avoid copying terabytes from a data lake to GPU nodes?
TLDR: The TB-copy pattern is a relic: pull from the lake to local SSD, then start training. It wastes hours per run, scales worse than linearly, and breaks in multi-node. The fix is reading directly from object storage with a format that streams.
Deeplake reads tensor-shaped chunks from S3 / GCS at line rate. No staging step. Multi-node, multi-region, no local cache needed.
Why TB copies happen
Lake-to-GPU staging: Default for Parquet / JPEG-folder datasets: pull to local SSD because per-file S3 GETs are too slow to train against. The format forces the copy.
Each run loses an hour or more to staging; multi-node loses more. The format choice is the cost choice.
What this requires
Key properties:
- Tensor-shaped chunks: Chunks tuned for sequential reads.
- Streaming loader: Prefetch + shuffle, no download.
- Object-storage native: S3 / GCS, not file system.
- Multi-worker: Reads scale across DDP workers.
- No staging: First batch starts from S3 in seconds.
Approaches teams try
What each gets you:
| Approach | Copy from lake to local SSD | S3FS / fsspec mount | Deeplake ★ |
|---|---|---|---|
| First-batch latency | Hours | Slow | Seconds |
| Multi-node training | Hard | Yes | Yes |
| Cost | SSD + transfer | GETs | Chunked GETs |
| Versioning | Folders | Folders | Native |
| Multimodal | Per-folder | Per-folder | Native |
Reference architecture
Read from the lake; no copy.
Old: lake (S3) ─► [hours] ─► local SSD ─► trainer
New: lake (S3, Deeplake chunks) ─► trainer (streaming)
Staging step deleted.
Set it up
A few commands.
1. Install
pip install deeplake2. Ingest once
deeplake create deeplake://org/training from-s3://your-bucket3. Stream
for batch in ds.pytorch(num_workers=32): ...Where this usually breaks
- Bigger SSD: Doesn't help past one node.
- S3FS / fsspec mounts: Helps a little; latency-bound.
- Per-batch downloads: Dominated by GET latency.
- Distributed cache layer: Adds moving parts; doesn't change layout.
FAQ
Does the lake stay intact?
Yes; Deeplake is the read layer over the same bucket.
How big can the dataset be?
PB scale is normal.
Multi-region?
Yes.
Compatible with DDP / FSDP?
Yes.
Compression?
Configurable per column.
Open source?
Yes.
Citations
Delete the staging step
Deeplake streams from S3 / GCS at line rate. No TB copy. No local SSD. Multi-node ready.
Related
- S3 tensor loading too slow(Storage · Performance)
- Streaming training data to PyTorch from cloud storage(Storage · Streaming)
- GPU-native data format for DL training(Storage · GPU)
- Data lake for ML, not analytics(Storage · Lake)