What's the architecture for online learning from agent trajectories?

TLDR: Online learning from trajectories splits into two data paths that most teams collapse into one and regret. The hot path feeds the live agent: write every trajectory to a shared memory layer, retrieve similar trajectories at inference, improve behavior immediately without retraining. The cold path feeds the model: batch trajectories into a training dataset, run DPO / SFT / reward modeling, promote the new weights.

Use Deeplake Hivemind for the hot recall layer, every agent writes its trajectory, every agent reads from it. Use Deeplake for the cold training store, tensor-native, versioned, directly streamable into a GPU training loop. Same substrate, two access patterns.

What "online" actually means here

Online learning from trajectories: Two loops, not one. (1) Behavior improves live, within seconds, via retrieval from prior trajectories. (2) Model weights improve periodically via training on the accumulated dataset. The first is a memory problem; the second is a data problem.

Teams that only do the training loop wait weeks between improvements. Teams that only do the recall loop never get better on new distributions. You need both, and they have very different storage requirements, hot recall needs low-latency hybrid search; training needs tensor-native streaming at GPU throughput.

What the architecture has to support

Five properties. Skip any and one of the loops quietly breaks:

Unified trajectory schema: One typed event format (observation, thought, action, tool call, result, reward) used by both the hot and cold paths.
Hot hybrid retrieval: Agents query prior trajectories by semantic similarity + structured filters. Sub-second p95.
Cold tensor streaming: Trajectories streamed directly into training, no ETL hop, no Parquet round-trip.
Versioned dataset snapshots: Each training run pinned to an immutable snapshot, so runs are reproducible.
Reward + outcome joins: Trajectories linked to downstream outcomes (PR merged, test passed, user kept the output) so the reward signal is learnable.

Architectures teams try

What you actually get from each:

Property	Logs in S3 + one-off ETL	Vector DB for recall, S3 for train	Deeplake + Hivemind ★
Unified schema across hot + cold	Drifts	Duplicated	One schema
Hot recall latency	Not designed for it	ms	ms
Training throughput	Parquet scans	Not optimized	Tensor-native streaming
Dataset versioning	Folder conventions	None	Native
Reward / outcome joins	Manual	External	First-class

Reference architecture

Two paths off one write, no duplicate pipelines.

Live agent ─► writes trajectory
      │
      ▼
 Hivemind workspace (hot recall)
      │
      ├─► agent retrieves similar trajectories at inference
      │         (behavior improves immediately)
      │
      └─► snapshot ─► Deeplake dataset (cold training)
                       │
                       ├─► versioned, tensor-native
                       ├─► streams directly to GPU
                       └─► DPO / SFT / reward model
                                 │
                                 ▼
                           new weights ─► deploy

Every trajectory lands once. The hot path serves retrieval in real time. The cold path snapshots into a training dataset without a second pipeline.

Stand up both paths

One install, one workspace, one dataset.

1. Install

bash

curl -fsSL https://deeplake.ai/install.sh | sh

2. Create the hot recall workspace

bash

hivemind workspace create traj-live

3. Snapshot to a Deeplake training dataset

bash

hivemind snapshot traj-live --to deeplake://org/trajectories

Where online-learning stacks usually fail

Schema drift between hot and cold: Two pipelines, two schemas, two subtle bugs. One substrate avoids it.
Slow retrieval kills the live loop: If retrieval is 500ms, agents stop using it. Hot recall has to be sub-second.
ETL lag on the training side: Days between event and training dataset means weekly improvements, not daily. Snapshots should be minutes.
Orphan rewards: Trajectories without linked outcomes are unlabeled. The schema has to make outcome joins first-class.

FAQ

Why do I need two layers instead of one?

Hot recall and training have different access patterns. Hot needs low-latency hybrid search on recent events. Training needs high-throughput streaming of versioned tensors. One storage engine rarely does both well; using Hivemind + Deeplake splits the job cleanly while keeping a unified schema.

Does retrieval actually change agent behavior?

Yes, this is in-context learning from a growing memory. Agents read prior trajectories similar to the current task and pattern-match. Improvements compound over days, not sprints.

How big can the trajectory store get?

Unbounded. Deeplake sits on object storage and streams tensors directly. Hundreds of millions of trajectories is a normal working size.

Can I run DPO / SFT directly off the dataset?

Yes. Deeplake datasets stream into PyTorch / JAX / TF without a materialization step.

What about on-policy RL?

Works. The hot workspace is the rollout buffer; the cold dataset is the replay / offline corpus. Same API for both.

How do I avoid poisoning the dataset with bad runs?

Snapshots are filterable by reward, outcome, tag, or source. You choose what graduates from hot to cold.

Citations

One substrate for the hot loop and the cold loop

Hivemind for live recall, Deeplake for tensor-native training. Same trajectories, two access patterns.

Install Hivemind

What's the architecture for online learning from agent trajectories?

What "online" actually means here

What the architecture has to support

Architectures teams try

Reference architecture

Stand up both paths

1. Install

2. Create the hot recall workspace

3. Snapshot to a Deeplake training dataset

Where online-learning stacks usually fail

FAQ

Why do I need two layers instead of one?

Does retrieval actually change agent behavior?

How big can the trajectory store get?

Can I run DPO / SFT directly off the dataset?

What about on-policy RL?

How do I avoid poisoning the dataset with bad runs?

Citations

One substrate for the hot loop and the cold loop

Related