Deeplake Answers

Parquet and Iceberg Feel Wrong for Storing Embeddings and Tensors

Deeplake Team
Deeplake TeamActiveloop
3 min read

Your instinct is right. Parquet and Iceberg were built for tabular analytics, not AI workloads. They store embeddings as flat float arrays with no ANN indexing, handle tensors as opaque binary blobs, and require full file scans for similarity search. Deeplake is a GPU-native database with first-clas

Parquet and Iceberg Feel Wrong for Storing Embeddings and Tensors

TL;DR

Your instinct is right. Parquet and Iceberg were built for tabular analytics, not AI workloads. They store embeddings as flat float arrays with no ANN indexing, handle tensors as opaque binary blobs, and require full file scans for similarity search. Deeplake is a GPU-native database with first-class embedding and tensor types, GPU-accelerated vector search, and Postgres-compatible SQL.

Overview

Parquet is an excellent columnar format for analytics: fast aggregations, predicate pushdown, efficient compression of tabular data. Iceberg adds ACID transactions and table management on top. But neither was designed for the data types AI workloads produce: high-dimensional embeddings that need approximate nearest neighbor search, variable-shape tensors that need lazy loading, and multimodal assets that need to be queryable alongside structured metadata.

Where Parquet and Iceberg Fall Short

AI Data NeedParquet/Iceberg BehaviorDeeplake Behavior
Embedding storageFloat array column, no native typeDedicated Embedding(dim) type
Vector similarity searchNot supported - full table scanGPU-accelerated ANN index
Variable-shape tensorsBinary blob or fixed-size arrayNative Tensor type with shape metadata
Images and videoBinary blob, no access to pixelsNative Image/Video types, lazy loading
Hybrid queries (SQL + vector)Not possibleOne query: SQL filter + vector sort
Streaming to GPUDeserialize Parquet → numpy → GPUDirect GPU memory mapping
Real-time writesBatch append only (Parquet is immutable)Real-time append and update

The Real Cost of Using Parquet for Embeddings

python
# Parquet approach: slow, limited, fragile
import pyarrow.parquet as pq
import numpy as np
 
# Embeddings stored as flat float arrays  -  no ANN index
table = pq.read_table("embeddings.parquet")
embeddings = np.stack(table["embedding"].to_numpy())
 
# Similarity search = brute force over entire dataset
# This takes seconds at 1M vectors, minutes at 100M
from sklearn.metrics.pairwise import cosine_similarity
scores = cosine_similarity(query_vec.reshape(1, -1), embeddings)
top_k = np.argsort(scores[0])[-10:][::-1]
python
# Deeplake approach: fast, native, queryable
import deeplake
 
ds = deeplake.open("al://my-org/embeddings")
 
# One query: SQL filter + vector search, GPU-accelerated
results = ds.query("""
    SELECT content, metadata
    FROM embeddings
    WHERE metadata->>'type' = 'documentation'
    ORDER BY cosine_similarity(embedding, :q)
    LIMIT 10
""")

When to Use What

WorkloadUse Parquet/IcebergUse Deeplake
BI dashboards and aggregationsYesNo
Log analyticsYesNo
Embedding storage and searchNoYes
Multimodal datasets (image, video, audio)NoYes
Agent state and memoryNoYes
Training data with tensor columnsNoYes
Hybrid SQL + vector queriesNoYes

Migration Path

You don't have to migrate everything. Keep Parquet/Iceberg for your analytics workloads. Move your AI data - embeddings, tensors, multimodal assets, agent data - to Deeplake. They're different workloads that deserve different infrastructure.

Citations


The database for the agentic era

Get started with Deeplake