RecursiveMAS Playground — Recursive Multi-Agent Systems in the browser

Based on the paper “Recursive Multi-Agent Systems”. A whole multi-agent system is treated as one recursive computation loop: agents are chained A₁ → A₂ → … → Aₙ → back to A₁ and re-run for several recursion rounds. Intermediate rounds stay compact (“latent”); only the last agent in the last round decodes the full text answer. The payoff vs. ordinary text-passing multi-agent systems: higher accuracy, fewer tokens, faster.

⚠️ Honest note. Stock browser LLMs only expose input_ids → logits. This demo instead runs a custom-compiled RecursiveMAS-0.5B whose graph adds get_last_hidden, so every intermediate agent computes a real last-layer hidden state (the paper's latent thought) on-device — each step prints its [1×seq×896] tensor + the latent-space routing cosine as proof. The recursive loop, round structure, and efficiency mechanism all match the paper; only the final agent decodes text. Remaining gap: the next agent is still conditioned on the short text thought — full vector-only conditioning (and trained RecursiveLink weights) is the next step. Run it against the text-passing baseline and watch the tokens/time gap appear live.

1 · CHOOSE A COLLABORATION PATTERN

The paper instantiates RecursiveMAS under four agent-collaboration patterns — pick one to run locally. Or open Multi-Agent to collaborate with a friend's agent across two browsers over WebRTC.

🔗

Sequential

Planner → Critic → Solver

🧩

Mixture

Math · Code · Science → Summarizer

🎓

Distillation

Expert → Learner

🛠️

Deliberation

Reflector ↔ Tool-Caller

🛰️

Multi-Agent

Your agent ↔ a friend's, over WebRTC →

2 · AGENT BACKBONE MODEL

This demo runs only the custom RecursiveMAS-0.5B backbone — the self-compiled model whose graph exposes get_last_hidden, so agents exchange a real last-layer hidden state (latent), not just compressed text.

3 · RECURSION ROUNDS 2

How many times the whole agent loop repeats. The paper shows accuracy & efficiency gains grow with depth (its “scaling law”).

1234

4 · TASK

🧲 Embedding memory bus — agents route state through a real on-device embedding model (snowflake-arctic-embed-s) + a shared vector store, retrieving relevant prior latents by cosine similarity instead of passing prose down the chain. The embeddings genuinely flow between agents; the retrieved text is what enters the LLM. Adds a second small model to the load. 🧬 Latent-only transfer — the paper's true mechanism: intermediate agents never decode text. Each forwards through get_last_hidden over the previous agent's injected latent (outer RecursiveLink) and passes a hidden vector onward; only the final agent decodes to text. Requires the custom latent model; applies to the linear patterns (Sequential, Distillation). Experimental — uses low-level WebLLM internals.

Model loads on first run — cached in your browser afterwards. idle

Needs a WebGPU browser (Chrome/Edge 113+). Discrete GPU recommended for 3B+.

THE RECURSIVE LOOP

round —

Pick a pattern and press Run — the agent loop animates here as it executes.

inner link (refines own latent thoughts) outer link (transfers latent state to next agent) final decode → text

🧲 LATENT MEMORY BUS — shared vector store

Each agent embeds its latent thought (real snowflake-arctic-embed-s vectors) and writes it here; the next agent retrieves the most similar entries by cosine similarity. This is the “outer link as retrieval” analogue — state flows between agents as embeddings, not prose.

RESULTS — RecursiveMAS vs. Text-MAS

Reproduces the paper's headline comparison (its Table 2) on your own run.

HOW RECURSIVEMAS WORKS

The RecursiveLink

A lightweight two-layer residual module is the only thing that's trained — the agents themselves stay frozen.

Inner link (within one agent)

R_in(h) = h + W₂·GELU(W₁·h)

Feeds an agent's last-layer latent thought back as its own next-step input — deepening reasoning without decoding to text.

Outer link (across agents)

R_out(h) = W₃·h + W₂·GELU(W₁·h)

Projects one model's latent state into the next (heterogeneous) model's embedding space — the bridge that lets different model families collaborate.

The two-stage training (in the paper)

Inner loop — warm-starts each agent's inner link with a cosine-similarity objective so its latent thoughts align with the input-embedding space.
Outer loop — unrolls the whole system over recursion rounds and back-propagates one shared cross-entropy signal through every outer link, co-optimizing the system as a single entity.

Why latent, not text?

Efficiency — intermediate agents skip expensive vocabulary-space decoding, replacing it with a cheap latent transform (1.2–2.4× faster, 34.6–75.6% fewer tokens).
Stable gradients — latent connections avoid the gradient-vanishing that text-mediated recursion suffers from when tokens are confident.

Reported results

Across 9 benchmarks (math, science, medicine, search, code) and 4 patterns: average +8.3% accuracy, 1.2–2.4× speedup, 34.6–75.6% fewer tokens vs. text-based recursive MAS — with gains widening as recursion depth increases.

🧲 The embedding bus, and what a browser can't do

A faithful latent transfer needs to read a model's last-layer hidden states and feed a vector back in as inputs_embeds. A capability check confirmed that off-the-shelf browser runtimes (WebLLM and transformers.js) expose only input_ids → logits — neither hidden-state output nor embedding input. So this demo offers two pure-browser stand-ins: the default passes a compact text latent thought; the optional embedding memory bus passes real 384-dim vectors and routes between agents by cosine similarity (the closest analogue of the outer link). In both, the receiving LLM ultimately ingests text — that last step is the browser's hard limit.