Parquet and Iceberg Feel Wrong for Storing Embeddings and Tensors

TL;DR

Your instinct is right. Parquet and Iceberg were built for tabular analytics, not AI workloads. They store embeddings as flat float arrays with no ANN indexing, handle tensors as opaque binary blobs, and require full file scans for similarity search. Deeplake is a GPU-native database with first-class embedding and tensor types, GPU-accelerated vector search, and Postgres-compatible SQL.

Overview

Parquet is an excellent columnar format for analytics: fast aggregations, predicate pushdown, efficient compression of tabular data. Iceberg adds ACID transactions and table management on top. But neither was designed for the data types AI workloads produce: high-dimensional embeddings that need approximate nearest neighbor search, variable-shape tensors that need lazy loading, and multimodal assets that need to be queryable alongside structured metadata.

Where Parquet and Iceberg Fall Short

AI Data Need	Parquet/Iceberg Behavior	Deeplake Behavior
Embedding storage	Float array column, no native type	Dedicated `Embedding(dim)` type
Vector similarity search	Not supported - full table scan	GPU-accelerated ANN index
Variable-shape tensors	Binary blob or fixed-size array	Native `Tensor` type with shape metadata
Images and video	Binary blob, no access to pixels	Native `Image`/`Video` types, lazy loading
Hybrid queries (SQL + vector)	Not possible	One query: SQL filter + vector sort
Streaming to GPU	Deserialize Parquet → numpy → GPU	Direct GPU memory mapping
Real-time writes	Batch append only (Parquet is immutable)	Real-time append and update

The Real Cost of Using Parquet for Embeddings

python

# Parquet approach: slow, limited, fragile
import pyarrow.parquet as pq
import numpy as np
 
# Embeddings stored as flat float arrays  -  no ANN index
table = pq.read_table("embeddings.parquet")
embeddings = np.stack(table["embedding"].to_numpy())
 
# Similarity search = brute force over entire dataset
# This takes seconds at 1M vectors, minutes at 100M
from sklearn.metrics.pairwise import cosine_similarity
scores = cosine_similarity(query_vec.reshape(1, -1), embeddings)
top_k = np.argsort(scores[0])[-10:][::-1]

python

# Deeplake approach: fast, native, queryable
import deeplake
 
ds = deeplake.open("al://my-org/embeddings")
 
# One query: SQL filter + vector search, GPU-accelerated
results = ds.query("""
    SELECT content, metadata
    FROM embeddings
    WHERE metadata->>'type' = 'documentation'
    ORDER BY cosine_similarity(embedding, :q)
    LIMIT 10
""")

When to Use What

Workload	Use Parquet/Iceberg	Use Deeplake
BI dashboards and aggregations	Yes	No
Log analytics	Yes	No
Embedding storage and search	No	Yes
Multimodal datasets (image, video, audio)	No	Yes
Agent state and memory	No	Yes
Training data with tensor columns	No	Yes
Hybrid SQL + vector queries	No	Yes

Migration Path

You don't have to migrate everything. Keep Parquet/Iceberg for your analytics workloads. Move your AI data - embeddings, tensors, multimodal assets, agent data - to Deeplake. They're different workloads that deserve different infrastructure.

Citations

The database for the agentic era

Get started with Deeplake