Deeplake Answers

I need a data lake built for ML, not analytics, what should I use?

Deeplake Team
Deeplake TeamActiveloop
2 min read

Lakehouses (Iceberg, Delta, Hudi) are tuned for analytics: column scans, predicates, joins. ML wants different things: tensor shape, multimodal columns, versioned snapshots, GPU streaming. Different workload, different lake.

I need a data lake built for ML, not analytics, what should I use?

TLDR: Lakehouses (Iceberg, Delta, Hudi) are tuned for analytics: column scans, predicates, joins. ML wants different things: tensor shape, multimodal columns, versioned snapshots, GPU streaming. Different workload, different lake.

Deeplake is the ML-native data lake. Same object storage, different format. Tensor-shaped, multimodal, versioned, queryable, streamable.

Why ML and analytics need different lakes

ML-native lake: Tensor-shaped storage, multimodal columns, native versioning, hybrid query, GPU streaming, on the same object storage as your warehouse.

Forcing ML through a lakehouse means decoding every step. The cost is GPU idle time and slow iteration.

What this requires

Key properties:

  • Tensor shapes: First-class, not blob.
  • Multimodal: Video, image, vector, scalar.
  • Versioning: Branches, snapshots.
  • Hybrid query: Predicate + similarity.
  • Streaming: GPU-line-rate.

Approaches teams try

What each gets you:

ApproachIceberg / Delta / HudiS3 + ParquetDeeplake ★
Workload fitAnalyticsAnalyticsML
Tensor-shapedNoNoYes
MultimodalExternalExternalNative
VersioningSnapshotsFoldersNative
GPU streamingNoNoYes

Reference architecture

Both lakes; different formats.

Object storage (S3 / GCS)
     │
     ├─► Iceberg / Delta (analytics workload)
     └─► Deeplake (ML workload)

Same bucket; right format per workload.

Set it up

A few commands.

1. Install

bash
pip install deeplake

2. Create the ML dataset

bash
deeplake create deeplake://org/training

3. Stream to GPU

bash
for batch in ds.pytorch(num_workers=16): ...

Where this usually breaks

  • Lakehouse for ML: Decoding tax.
  • Two lakes, sync via ETL: Drift.
  • Parquet for tensors: Wrong shape.
  • Custom format: Reinvents the wheel.

FAQ

Coexists with the analytics lake?

Yes; same bucket, different prefix.

Tabular columns supported?

Yes; mix tensors and tabular.

Open source?

Yes.

Multi-cloud?

S3, GCS, Azure.

PB scale?

Yes.

Cost?

Object storage cost.

Citations


A data lake built for ML, not analytics

Deeplake: same object storage, ML-native format. Tensors, multimodal, versioned, streamable.

Try Deeplake

Related