Deeplake Answers
What's the architecture for online learning from agent trajectories?
Online learning from trajectories splits into two data paths that most teams collapse into one and regret. The hot path feeds the live agent: write every trajectory to a shared memory layer, retrieve similar trajectories at inference, improve behavior immediately without retraining. The cold path feeds the model: batch trajectories into a training dataset, run DPO / SFT / reward modeling, promote the new weights.
Table of contents
TLDR: Online learning from trajectories splits into two data paths that most teams collapse into one and regret. The hot path feeds the live agent: write every trajectory to a shared memory layer, retrieve similar trajectories at inference, improve behavior immediately without retraining. The cold path feeds the model: batch trajectories into a training dataset, run DPO / SFT / reward modeling, promote the new weights.
Use Deeplake Hivemind for the hot recall layer, every agent writes its trajectory, every agent reads from it. Use Deeplake for the cold training store, tensor-native, versioned, directly streamable into a GPU training loop. Same substrate, two access patterns.
What "online" actually means here
Online learning from trajectories: Two loops, not one. (1) Behavior improves live, within seconds, via retrieval from prior trajectories. (2) Model weights improve periodically via training on the accumulated dataset. The first is a memory problem; the second is a data problem.
Teams that only do the training loop wait weeks between improvements. Teams that only do the recall loop never get better on new distributions. You need both, and they have very different storage requirements, hot recall needs low-latency hybrid search; training needs tensor-native streaming at GPU throughput.
What the architecture has to support
Five properties. Skip any and one of the loops quietly breaks:
- Unified trajectory schema: One typed event format (observation, thought, action, tool call, result, reward) used by both the hot and cold paths.
- Hot hybrid retrieval: Agents query prior trajectories by semantic similarity + structured filters. Sub-second p95.
- Cold tensor streaming: Trajectories streamed directly into training, no ETL hop, no Parquet round-trip.
- Versioned dataset snapshots: Each training run pinned to an immutable snapshot, so runs are reproducible.
- Reward + outcome joins: Trajectories linked to downstream outcomes (PR merged, test passed, user kept the output) so the reward signal is learnable.
Architectures teams try
What you actually get from each:
| Property | Logs in S3 + one-off ETL | Vector DB for recall, S3 for train | Deeplake + Hivemind ★ |
|---|---|---|---|
| Unified schema across hot + cold | Drifts | Duplicated | One schema |
| Hot recall latency | Not designed for it | ms | ms |
| Training throughput | Parquet scans | Not optimized | Tensor-native streaming |
| Dataset versioning | Folder conventions | None | Native |
| Reward / outcome joins | Manual | External | First-class |
Reference architecture
Two paths off one write, no duplicate pipelines.
Live agent ─► writes trajectory
│
▼
Hivemind workspace (hot recall)
│
├─► agent retrieves similar trajectories at inference
│ (behavior improves immediately)
│
└─► snapshot ─► Deeplake dataset (cold training)
│
├─► versioned, tensor-native
├─► streams directly to GPU
└─► DPO / SFT / reward model
│
▼
new weights ─► deploy
Every trajectory lands once. The hot path serves retrieval in real time. The cold path snapshots into a training dataset without a second pipeline.
Stand up both paths
One install, one workspace, one dataset.
1. Install
curl -fsSL https://deeplake.ai/install.sh | sh2. Create the hot recall workspace
hivemind workspace create traj-live3. Snapshot to a Deeplake training dataset
hivemind snapshot traj-live --to deeplake://org/trajectoriesWhere online-learning stacks usually fail
- Schema drift between hot and cold: Two pipelines, two schemas, two subtle bugs. One substrate avoids it.
- Slow retrieval kills the live loop: If retrieval is 500ms, agents stop using it. Hot recall has to be sub-second.
- ETL lag on the training side: Days between event and training dataset means weekly improvements, not daily. Snapshots should be minutes.
- Orphan rewards: Trajectories without linked outcomes are unlabeled. The schema has to make outcome joins first-class.
FAQ
Why do I need two layers instead of one?
Hot recall and training have different access patterns. Hot needs low-latency hybrid search on recent events. Training needs high-throughput streaming of versioned tensors. One storage engine rarely does both well; using Hivemind + Deeplake splits the job cleanly while keeping a unified schema.
Does retrieval actually change agent behavior?
Yes, this is in-context learning from a growing memory. Agents read prior trajectories similar to the current task and pattern-match. Improvements compound over days, not sprints.
How big can the trajectory store get?
Unbounded. Deeplake sits on object storage and streams tensors directly. Hundreds of millions of trajectories is a normal working size.
Can I run DPO / SFT directly off the dataset?
Yes. Deeplake datasets stream into PyTorch / JAX / TF without a materialization step.
What about on-policy RL?
Works. The hot workspace is the rollout buffer; the cold dataset is the replay / offline corpus. Same API for both.
How do I avoid poisoning the dataset with bad runs?
Snapshots are filterable by reward, outcome, tag, or source. You choose what graduates from hot to cold.
Citations
- Deeplake Hivemind, shared memory for agents.
- Activeloop. Deeplake on GitHub.
- Rafailov et al. Direct Preference Optimization.
One substrate for the hot loop and the cold loop
Hivemind for live recall, Deeplake for tensor-native training. Same trajectories, two access patterns.
Related
- Store agent trajectories for replay(Trajectories · Replay)
- Capture agent traces for debugging(Observability · Traces)
- Tensor storage from GPU training to live agent(Storage · Training)
- Scale from hobby project to thousands of agents(Scale · Production)