Recursive multi-agent collaboration with local LLMs — no servers, no API keys, pure WebGPU.

Based on the paper “Recursive Multi-Agent Systems”. A whole multi-agent system is treated as one recursive computation loop: agents are chained A₁ → A₂ → … → Aₙ → back to A₁ and re-run for several recursion rounds. Intermediate rounds stay compact (“latent”); only the last agent in the last round decodes the full text answer. The payoff vs. ordinary text-passing multi-agent systems: higher accuracy, fewer tokens, faster.

⚠️ Honest note. Stock browser LLMs only expose input_ids → logits. This demo instead runs a custom-compiled RecursiveMAS-0.5B whose graph adds get_last_hidden, so every intermediate agent computes a real last-layer hidden state (the paper's latent thought) on-device — each step prints its [1×seq×896] tensor + the latent-space routing cosine as proof. The recursive loop, round structure, and efficiency mechanism all match the paper; only the final agent decodes text. Remaining gap: the next agent is still conditioned on the short text thought — full vector-only conditioning (and trained RecursiveLink weights) is the next step. Run it against the text-passing baseline and watch the tokens/time gap appear live.

1 · CHOOSE A COLLABORATION PATTERN

The paper instantiates RecursiveMAS under four agent-collaboration patterns — pick one to run locally. Or open Multi-Agent to collaborate with a friend's agent across two browsers over WebRTC.

🛰️
Multi-Agent
Your agent ↔ a friend's, over WebRTC →

2 · AGENT BACKBONE MODEL

This demo runs only the custom RecursiveMAS-0.5B backbone — the self-compiled model whose graph exposes get_last_hidden, so agents exchange a real last-layer hidden state (latent), not just compressed text.

3 · RECURSION ROUNDS  2

How many times the whole agent loop repeats. The paper shows accuracy & efficiency gains grow with depth (its “scaling law”).

1234
4 · TASK
Model loads on first run — cached in your browser afterwards. idle

Needs a WebGPU browser (Chrome/Edge 113+). Discrete GPU recommended for 3B+.

THE RECURSIVE LOOP
round —

Pick a pattern and press Run — the agent loop animates here as it executes.

inner link (refines own latent thoughts) outer link (transfers latent state to next agent) final decode → text
HOW RECURSIVEMAS WORKS

The RecursiveLink

A lightweight two-layer residual module is the only thing that's trained — the agents themselves stay frozen.

Inner link (within one agent)
Rin(h) = h + W₂·GELU(W₁·h)

Feeds an agent's last-layer latent thought back as its own next-step input — deepening reasoning without decoding to text.

Outer link (across agents)
Rout(h) = W₃·h + W₂·GELU(W₁·h)

Projects one model's latent state into the next (heterogeneous) model's embedding space — the bridge that lets different model families collaborate.

The two-stage training (in the paper)

Why latent, not text?

Reported results

Across 9 benchmarks (math, science, medicine, search, code) and 4 patterns: average +8.3% accuracy, 1.2–2.4× speedup, 34.6–75.6% fewer tokens vs. text-based recursive MAS — with gains widening as recursion depth increases.

🧲 The embedding bus, and what a browser can't do

A faithful latent transfer needs to read a model's last-layer hidden states and feed a vector back in as inputs_embeds. A capability check confirmed that off-the-shelf browser runtimes (WebLLM and transformers.js) expose only input_ids → logits — neither hidden-state output nor embedding input. So this demo offers two pure-browser stand-ins: the default passes a compact text latent thought; the optional embedding memory bus passes real 384-dim vectors and routes between agents by cosine similarity (the closest analogue of the outer link). In both, the receiving LLM ultimately ingests text — that last step is the browser's hard limit.