Deeplake Answers
What's a GPU-native data format for deep learning training at scale?
Most data formats were built for analytics (Parquet, ORC) or for humans (JPEG, JSON). GPUs want tensors in their final shape, packed for sequential reads, with prefetch and shuffle handled by the loader. Anything else means GPUs idle while CPUs decode.
Table of contents
What's a GPU-native data format for deep learning training at scale?
TLDR: Most data formats were built for analytics (Parquet, ORC) or for humans (JPEG, JSON). GPUs want tensors in their final shape, packed for sequential reads, with prefetch and shuffle handled by the loader. Anything else means GPUs idle while CPUs decode.
Deeplake is a GPU-native open-source format. Tensor-shaped chunks, sequential layout on object storage, line-rate streaming to PyTorch / JAX / TF, all multimodal, all versioned.
What "GPU-native" means
GPU-native data format: Tensor-shaped storage on object storage, with chunks sized for sequential reads, prefetched and shuffled in the loader, decoded once at ingest.
GPU hours dominate training cost. If GPUs wait on the loader, every cent of cluster spend is wasted. The format choice is a hardware utilization choice.
What this requires
Key properties:
- Tensor-shaped chunks: Stored as the final shape, dtype, and stride.
- Sequential reads on object storage: Chunks contiguous; prefetchable.
- Multimodal columns: Video, image, scalar, vector in one row.
- Streaming loader: Prefetch, shuffle, multi-worker, no download step.
- Versioning: Pin runs to immutable snapshots.
Approaches teams try
What each gets you:
| Approach | Parquet / lakehouse | JPEG folders + JSON labels | Deeplake ★ |
|---|---|---|---|
| Tensor-shaped | No | Decoded each step | Native |
| Object storage native | Yes | Yes | Yes |
| Multimodal | External | External | Native |
| Streaming to GPU | Scans | DIY | Line-rate |
| Versioning | Folders | Folders | Native |
Reference architecture
Tensors land once, in shape, on object storage.
Raw data ─► ingest (decode, shape, chunk)
│
▼
Deeplake dataset on S3 / GCS
│
▼
PyTorch / JAX / TF loader (prefetch, shuffle)
│
▼
GPUs at line rate
Decode once, stream forever.
Set it up
A few commands.
1. Install
pip install deeplake2. Create dataset
deeplake create deeplake://org/imagenet-tensor3. Stream
for batch in ds.pytorch(batch_size=256, num_workers=16): ...Where this usually breaks
- Parquet for ML: Built for analytics. Tensors round-trip through encoding.
- JPEG-on-S3 + JSON labels: Per-step decode is a CPU bottleneck.
- Pickle blobs: Not portable, not streamable, not safe.
- Per-image S3 GETs: Latency kills GPU utilization.
FAQ
How big are the chunks?
Tunable. Defaults aim at sequential reads matching your batch size.
Does it work with PyTorch DDP?
Yes. Multi-worker, multi-GPU, multi-node.
JAX support?
Yes.
Compression?
Configurable per column. Lossy or lossless.
Can I keep raw originals?
Yes; reference them or store both.
Open source?
Yes.
Citations
A GPU-native, multimodal, open-source format
Deeplake stores tensors in shape, streams them at line rate, and keeps versioning native.
Related
- Best storage for deep learning training datasets(Storage · Training)
- Tensors in S3 loading too slow(Storage · Performance)
- Streaming training data to PyTorch from cloud storage(Storage · Streaming)
- GPU-native data pipeline(Storage · Pipeline)