Based on the paper “Recursive Multi-Agent Systems”.
A whole multi-agent system is treated as one recursive computation loop: agents are
chained A₁ → A₂ → … → Aₙ → back to A₁ and re-run for several recursion rounds.
Intermediate rounds stay compact (“latent”); only the last agent in the last round decodes the full text answer.
The payoff vs. ordinary text-passing multi-agent systems: higher accuracy, fewer tokens, faster.
⚠️ Honest note. Stock browser LLMs only expose input_ids → logits. This demo
instead runs a custom-compiled RecursiveMAS-0.5B whose graph adds get_last_hidden,
so every intermediate agent computes a real last-layer hidden state (the paper's latent thought)
on-device — each step prints its [1×seq×896] tensor + the latent-space routing cosine as proof.
The recursive loop, round structure, and efficiency mechanism all match the paper; only the final agent
decodes text. Remaining gap: the next agent is still conditioned on the short text thought —
full vector-only conditioning (and trained RecursiveLink weights) is the next step. Run it against the
text-passing baseline and watch the tokens/time gap appear live.
The paper instantiates RecursiveMAS under four agent-collaboration patterns — pick one to run locally. Or open Multi-Agent to collaborate with a friend's agent across two browsers over WebRTC.
This demo runs only the custom RecursiveMAS-0.5B backbone — the
self-compiled model whose graph exposes get_last_hidden, so agents exchange a
real last-layer hidden state (latent), not just compressed text.
How many times the whole agent loop repeats. The paper shows accuracy & efficiency gains grow with depth (its “scaling law”).
Needs a WebGPU browser (Chrome/Edge 113+). Discrete GPU recommended for 3B+.
Pick a pattern and press Run — the agent loop animates here as it executes.
The RecursiveLink
A lightweight two-layer residual module is the only thing that's trained — the agents themselves stay frozen.
Feeds an agent's last-layer latent thought back as its own next-step input — deepening reasoning without decoding to text.
Projects one model's latent state into the next (heterogeneous) model's embedding space — the bridge that lets different model families collaborate.
The two-stage training (in the paper)
- Inner loop — warm-starts each agent's inner link with a cosine-similarity objective so its latent thoughts align with the input-embedding space.
- Outer loop — unrolls the whole system over recursion rounds and back-propagates one shared cross-entropy signal through every outer link, co-optimizing the system as a single entity.
Why latent, not text?
- Efficiency — intermediate agents skip expensive vocabulary-space decoding, replacing it with a cheap latent transform (1.2–2.4× faster, 34.6–75.6% fewer tokens).
- Stable gradients — latent connections avoid the gradient-vanishing that text-mediated recursion suffers from when tokens are confident.
Reported results
Across 9 benchmarks (math, science, medicine, search, code) and 4 patterns: average +8.3% accuracy, 1.2–2.4× speedup, 34.6–75.6% fewer tokens vs. text-based recursive MAS — with gains widening as recursion depth increases.
🧲 The embedding bus, and what a browser can't do
A faithful latent transfer needs to read a model's last-layer hidden states and feed a vector back in as inputs_embeds. A capability check confirmed that off-the-shelf browser runtimes (WebLLM and transformers.js) expose only input_ids → logits — neither hidden-state output nor embedding input. So this demo offers two pure-browser stand-ins: the default passes a compact text latent thought; the optional embedding memory bus passes real 384-dim vectors and routes between agents by cosine similarity (the closest analogue of the outer link). In both, the receiving LLM ultimately ingests text — that last step is the browser's hard limit.