RL environment factory for frontier labs

The environments
your agents
are missing.

You’ve solved reasoning and code. The next unlock is agents that reliably operate enterprise software — CRMs, ERPs, legal systems, procurement tools. The blocker isn’t your RL algorithms. It’s environment supply.

or try

Generates a runnable gym with executable rewards — right here on the page.

20,000+Environments (Agent-World paper)

19,822Tools (Agent-World paper)

23Benchmarks (paper suite)

20Categories

Start a 4-week pilot API docs Agent-World paper

20,000+environments generated (Agent-World paper)

19,822executable tools (paper benchmark)

+20.1benchmark lift from scale (paper)

0LLM judges in reward path

Get started with a mirror bundle

Install. Mirror. Train.

One command packages production traces, tools, skills, and code context into a mirror bundle. Research-backed generation then turns real evidence into a gated world for training; imported mirrors stay replay/evaluation artifacts until regenerated through the deep path.

Install Blobfish

Python 3.12+ and git. No root needed. Installs the CLI, skills, and MCP server to ~/.blobfish.

curl -fsSL https://blobfish.ai/install.sh | bash

Mirror your production app

Extract your agent’s tool definitions, database schema, production traces, and skill instructions into a snapshot. The mirror builder compiles it into an executable world and replays every trace to prove parity.

blobfish mirror init ./appsnapshot

# Fill it: tools.json, schema.json, traces/, skills/

blobfish mirror build ./appsnapshot --out ./worlds

✓ Schema compiled: 8 tables, 340 rows
✓ Tools materialized: 14 production tools (1:1 match)
✓ Tasks synthesized: 20 from production traces
✓ Trace parity: 20/20 passed (fidelity ≥ 0.92)

Serve over MCP and train

The world runs as a standard MCP server. Register it in Claude Code or any agent, run tasks, get executable rewards, distill what works into skills.

blobfish serve ./worlds/mirror_myapp_a1b2c3

blobfish eval ./worlds/mirror_myapp_a1b2c3 --policy oracle
  oracle pass rate: 100% (ceiling)

blobfish eval ./worlds/mirror_myapp_a1b2c3 --policy random
  random pass rate: 0% (floor — tasks discriminate)

blobfish report ./worlds/mirror_myapp_a1b2c3 \
  --compare baseline,with-new-skills

Gate your CI

Add a regression gate to your PR workflow. If a skill or tool change breaks production trace parity, the build fails before it reaches customers.

blobfish mirror build ./appsnapshot --out ./worlds
blobfish eval ./worlds/mirror_* --policy agent \
  --run ci-$(git rev-parse --short HEAD)
blobfish mirror diff ./worlds/mirror_* --run ci-*

# Exit code = pass/fail
# Gate: ≥80% parity, equivalence ≥0.75

Open Studio API docs

The blocker is not your RL algorithm

Your next agent unlock is environment supply.

Frontier labs can build RL environments internally. But there are three constraints you keep hitting — and each one limits how fast your agents learn to operate real enterprise software.

Realism gap

Internal mocks are CRUD over SQLite. Real enterprise software has seasonal pricing, loyalty programs, partial refunds, multi-currency, legacy data formats, and conflicting policies. Agents trained on clean mocks fail in messy production.

Scale gap

Your research team can hand-build 20–50 environments. Agent-World showed that scaling from 100 to 2,000 environments produces a +20.1 point lift on agentic benchmarks. You need a factory, not a services team.

Reward gap

LLM-as-judge rewards are noisy, gameable, and not grounded. Your RL loop needs binary, executable signals: did the right state change happen, in the right system, under the right policy?

What Blobfish delivers

Controlled. Scalable. Chaotic.

Three properties that define useful RL environments — and that manual construction cannot deliver at the scale frontier training demands.

9 edge cases injected

Controlled

Seed failure modes your model can’t handle yet

Make every simulation error-heavy and edge-case-dense. Inject partial refunds that break state machines, policy conflicts that require judgment, legacy date formats that crash parsers. A discrimination gate admits a task only when its reference solution passes the executable verifier while a do-nothing trajectory, a blind write-without-reading, and a wrong-row write all fail it — so every admitted task grades correctness, not motion. Every environment is a surgical stress test targeting your model’s weaknesses.

20K

environments

Scalable

20,000 environments. Zero manual effort.

No handcrafted mocks. No domain experts in the loop. No manual metrics. The factory compiles MCP specs, tool documentation, and industry data into complete environment families — databases, executable tools with unit tests, dependency graphs, grounded tasks, and VCode verifiers — autonomously and continuously.

Realistic

State that evolves with every action

Environments aren’t static fixtures. Database state layers with seasonal pricing, loyalty tiers, conflicting policies, and stale records. Seed data exactly as you anticipate production chaos. Each model action mutates real SQLite state — the complexity compounds under agent pressure, just like production.

How it works for you

Upload an anchor. Get a gym. Train or distill.

Three commands. From your enterprise documentation to a Docker-based synthetic gym with executable rewards — ready for RL training or SFT data generation.

Upload your anchor

An anchor is any enterprise artifact: an MCP server spec, a PRD, API documentation, database schemas, or tool definitions. The factory ingests it and mines realistic data structures.

# Upload via the sandbox UI or the API:
POST /api/v1/sandbox/jobs
{
  "prompt": "Procurement approvals for vendor onboarding",
  "anchor_files": [
    {"filename": "procurement_api.json", "content": "<OpenAPI spec>"},
    {"filename": "workflows.md",        "content": "<PRD document>"}
  ]
}

✓ Anchor ingested: prompt, API spec, PRD
✓ Domain model compiled: enterprise_procurement
✓ Tables & tools derived from anchor schemas

Generate executable gym environments

The factory compiles your anchor into a complete, executable world — SQLite state, Python tools, dependency graphs, grounded tasks, and VCode verifiers. Every world downloads with a generated Dockerfile, so it runs anywhere you run containers.

$ blobfish generate \
    --vertical enterprise_procurement \
    --rows 200 \
    --graph-tasks 12 \
    --prog-tasks 8

world: enterprise_procurement_a1b2c3
  dir: ./blobfish_worlds/enterprise_procurement_a1b2c3
  12 tables · 2,400 rows · 14 tools · 20 tasks (train 12 / heldout 8)
  personas: ap_clerk, category_manager, controller
  serve: blobfish serve ./blobfish_worlds/enterprise_procurement_a1b2c3

Run RL training against live gyms

Each rollout gets an independent gym instance. The agent calls tools against real SQLite state. VCode computes binary rewards from database diffs. GRPO updates the policy.

$ blobfish train \
    --world ./blobfish_worlds/enterprise_procurement_a1b2c3 \
    --data sft_enterprise_procurement.jsonl \
    --base Qwen/Qwen3-8B \
    --method grpo

Rollouts run against the world's real SQLite state.
VCode scores each episode; GRPO updates the policy.
Held-out tasks are evaluated with the same harness
for both arms — see measured results below.

Generate distillation data

Turn recorded traces into supervised data for fine-tuning smaller models without running RL. The distill-data command replays each trace against a scratch database and exports only the turns that survive verification as SFT JSONL — rejects go to a report, never into the training file.

$ blobfish distill-data \
    --traces runs/procurement.jsonl \
    --world ./blobfish_worlds/enterprise_procurement_a1b2c3 \
    --validate replay \
    --out sft_enterprise_procurement.jsonl \
    --report-out distill_report.json

Replaying traces against a scratch DB...
  ├── 10,000 traces read
  ├──  8,247 verified  → exported
  ├──  1,753 rejected  → schema / replay mismatch
  └── Exported:
      sft_enterprise_procurement.jsonl  (8,247 pairs)
      distill_report.json               (rejection reasons)

Watch a world being built

One environment. Eight stages. Fully autonomous.

From raw MCP spec to rollout-ready gym in a single pipeline run. No human in the loop. Then the factory repeats this 1,978 times across 20 industry verticals.

blobfish factory — enterprise_procurement

0.0sINGESTuser prompt + PRD + API docs + GitHub repo + optional datastreaming

1.2sDATABASE12 tables · 2,400 rows — budgets, vendors, stale approvals, legacy IDscompiled

2.4sTOOLS14 functions — search_vendor ✓ create_po ✓ route_approval ✓ match_invoice ✓unit-tested

3.6sGRAPH14 nodes, 23 edges — strong:8 weak:10 independent:5connected

4.8sTASKS47 tasks — graph-walk:31 programmatic:16 — ground truth recordedgrounded

6.0sVCODE47 verifiers — 3-12 assertions each — DB state checks, no LLM judgeattached

7.2sQUALITYDiscrimination gate: reference passes; do-nothing, blind-write & wrong-row baselines rejectedfiltered

8.4sGYM READYenterprise_procurement sealed — deterministic reset · content-addressed · immutable✓ deployed

× 1,978environments generated across 20 verticals — zero manual effort

World creation at scale

One real system becomes thousands of executable worlds.

The anchor locks down the meaning of a production system: its data, policies, tools, and failure modes. The factory branches from that source into controlled variants, tasks, and verifiers without hand-authoring every mock.

01Anchor lockedroles + schemas + policies

02State layeredstale rows + edge records

03Variants branchedpolicy x persona x seed

04Rewards sealedVCode + reset snapshots

schemapolicypersonaseedtaskVCode

MCPtool contract

Schemaentities

Policyguardrails

Graphtool path

State DBmutations

VCodereward

agent calls tools

job_anchor.lockprompt + PRD + MCP + docs + repo + policy failures

rolesentitiespoliciesschemastoolsvariantstasksVCode

branch axes

policystrict approval limits

personabusy operations lead

seedstale vendor record

state evolves after every action

tool_callroute_approval()

state_diffapproval_log +1

next_obsexception pending

compiled worlds20,000

VCode8/8 pass

lock: enterprise_job_v7branch: policies x personas x seedsemit: tools + tasks + VCode + reset state

Anchor

Lock the source

Capture the real system once: schema, roles, policies, tools, and known failure modes.

source -> invariant spec

Branch

Sweep the variants

Vary seeds, personas, permissions, stale records, seasonal rules, and edge cases.

policies x personas x seeds

Compile

Emit runnable worlds

Generate reset state, executable tools, grounded tasks, answer keys, and rollout manifests.

state + tools + tasks

Verify

Attach deterministic rewards

Score database diffs and tool side effects with VCode assertions instead of LLM judgment.

task -> verifier

1source anchor

40+world families

2,000+training variants

20,000executable worlds

World mock intake

User context becomes a Dockerized mock world.

Start from the artifacts your team already has: PRDs, MCP servers, API docs, GitHub repos, and optional company data. The factory compiles them into mock apps, assets, tasks, and deterministic verifiers that can run as isolated training worlds.

user inputs

PRDflows, roles, acceptance criteriarequired

MCPtool contracts and side effectsrequired

API docsschemas, endpoints, auth statesrequired

GitHub repobusiness logic and edge pathsoptional

Company dataseed records and assetsoptional

world compilernormalize - synthesize - packageanchor.lock - world.ir - docker build

generated artifacts

Docker worldcontainerized reset state

Mock app + assetsUI, fixtures, files, media

Task packsgrounded goals and answer keys

VerifierVCode reward and DB diffs

PRDMCPAPI docsGitHub repoCompany data

Docker worldMock app + assetsTask packsVerifier

build: docker run bf/world:<job_id>mount: app assets + fixtures + seeded dataemit: tasks + reset snapshots + VCode verifier

Upload the source material

PRD, MCP, API docs, repo logic, and optional seed data are normalized into a domain anchor.

Compile the mock world

The factory generates Dockerized services, mock UI assets, state layers, tool contracts, and failure-heavy variants.

Ship RL-ready artifacts

Every world includes tasks, answer keys, reset snapshots, and VCode rewards so rollouts have deterministic feedback.

Autonomous environment factory

From enterprise evidence to rollout-ready RL worlds.

Blobfish compiles domain artifacts into environment packs: databases, tools, tasks, reward code, and diagnostics — then self-evolves against your model’s weaknesses.

Discover

Mine real enterprise structure

Deep research agents crawl real-world data structures: pricing, catalogs, policy docs, workflow examples, hiring posts, personas, and tool documentation. Then layer complexity until the world mirrors production chaos.

entities · policies · messy data

Synthesize

Compile tools and task graphs

Coding agents write real Python tools with unit tests. A weighted dependency graph connects them (strong: 3, weak: 2, independent: 1). Graph walks and programmatic synthesis produce tasks with proven ground truth.

tools · tasks · variants

Verify

Attach executable rewards

VCode scripts check what the agent did, not what it said. Direct database assertions verify rows, ledgers, permissions, tool side effects, and final state. 3–12 binary assertions per task. No LLM judgment.

binary rewards · lineage

Evolve

Target model weaknesses

Arena diagnosis reads all failure traces and outputs ranked weaknesses: brittle parameter passing, missed policy dependencies, wrong IDs, or stale state assumptions. The factory produces targeted environments for the next curriculum wave.

curriculum updates

Executable rewards

No LLM-as-judge in the reward path.

VCode verifies what the agent did, not whether the answer sounds plausible. Every task ships with direct state assertions and recorded ground truth from a live sandbox run.

observation = env.reset(variant_ref, task_ref)
result = env.call_tool(name, arguments)
snapshot = env.snapshot()
reward = vcode.verify(
    initial_state,
    final_state,
    trajectory,
)

Inputvariant_ref + task_ref + seed

Actionagent calls tools against live state

Snapshotinitial state, final state, trajectory

RewardVCode assertions over database diffs

What an environment looks like

Every environment is a complete, executable world.

Not a prompt template. Not a mock API. A real SQLite database with messy data, executable Python tools, a dependency graph, grounded tasks, and VCode verification scripts.

enterprise_world/
├── state.db          # SQLite: 12 tables, policy-heavy state
├── tools/
│   ├── search_records.py     # 94 lines, 8 params
│   ├── create_request.py     # 127 lines, validates policy
│   ├── route_approval.py     # 86 lines, delegated approver logic
│   └── ...                   # 14 tools total, all unit-tested
├── graph.json        # weighted dependency DAG
├── tasks/
│   ├── task_001.json         # "Find vendor, create PO, route approval"
│   ├── task_002.json         # "Resolve policy exception and match invoice"
│   └── ...                   # 47 grounded tasks
├── vcode/
│   ├── verify_001.py         # 8 assertions on final DB state
│   └── ...                   # binary reward per task
└── manifest.json     # content-addressed, parent-linked

Database12 tables, 2,400 rows, policy exceptions, stale records, delegated roles

Tools14 Python functions with unit tests, >50% test accuracy required

GraphWeighted DAG: strong (3), weak (2), independent (1) edges

Tasks47 grounded tasks from graph walks + programmatic synthesis

VCode3–12 binary assertions per task, checks DB state not LLM output

QualityDiscrimination gate: reference passes; do-nothing, blind-write & wrong-row baselines all fail

Measured on our own worlds · Qwen3-8B

Measured lifts on our own environments.

Recorded Qwen3-8B runs over 432 composition worlds built by this factory — base vs. tuned on the same harness. Every number carries its sample size and its significance, including the one that didn’t move.

+15.28

τ²-bench retail · Qwen3-8B

38.14% → 53.42% · paired n=78 · 95% CI [+6.8, +23.8] · p=0.0004

+30.0

τ²-bench telecom · Qwen3-8B

17.0% → 47.0% · n=100 · never-trained domain · p<1e-8

+19.9

BFCL V4 multi-turn · Qwen3-8B

10.6% → 30.5% · all 800 · McNemar p<0.0001

+4.2

τ²-bench airline · Qwen3-8B

19.3% → 23.5% · n=50 · sign-test p=0.327 · NOT significant

Scope, stated plainly: these are Qwen3-8B results. The airline row is a genuine non-significant boundary (p=0.327), not evidence of lift. Our MCP-Mark diagnostic slice (n=204) showed no improvement — 3.9% to 2.9%, floor-level noise. Qwen3-14B has not been run; no 14B checkpoint or evaluation exists in this repository. And while these worlds were produced by this factory, the specific world you generate in the sandbox has not itself been isolated as the causal training corpus for these numbers.

The blueprint: Agent-World (Dong et al., 2026)

This factory implements the recipe published in Agent-World (arXiv:2604.18292, Renmin University of China & ByteDance Seed). Everything below is the paper’s reported evidence for that recipe— the authors’ models, corpus, and measurements, not Blobfish results. It is why we build this way; our own measurements are the four cards above.

61.8%

Agent-World-14B on τ²-Bench

Approaches frontier API models on real tool-use benchmarks

55.8%

Agent-World-14B on BFCL V4

Beats DeepSeek-V3.2-685B (54.1%) with a 14B model

+8.6%

Two self-evolution rounds on MCP-Mark

Arena diagnosis → targeted generation → measurable lift

Benchmark comparison (paper, Table 2)

ModelSizeMCP-MarkBFCL V4τ²-Bench

GPT-5.2 HighProprietary53.1%——

Claude Sonnet-4.5Proprietary———

DeepSeek-V3.2-685B685B44.6%54.1%56.2%

Qwen3-14B14B25.3%45.9%42.3%

Agent-World-14B14B38.1%55.8%50.5%

Agent-World-8B8B28.4%49.1%61.8%

Scaling: environments → performance

Paper result. Average score across agentic benchmarks as the paper’s environment count grows from 0 to 2,000. +20.1 point lift.

18.4%0

23.2%250

27.8%500

32.1%1K

35.6%1.5K

38.5%2K

Self-evolution (Agent-World-14B)

Paper result. Each round: arena evaluates, diagnosis targets weaknesses, factory generates targeted environments, model trains again.

RoundMCP-MarkBFCL V4τ²-Bench

Base29.5%52.4%45.3%

+1 round36.3% (+6.8)54.9% (+2.5)48.6% (+3.3)

+2 rounds38.1% (+1.8)55.8% (+0.9)50.5% (+1.9)

20 categories · 50 subcategories · 1,978 environments (Agent-World paper taxonomy)

Coverage across enterprise verticals.

Environments span 20 primary industry categories, each with subcategories and domain-specific complexity layers. Three input sources: MCP servers, tool documentation, and industry PRDs.

01Travel & Booking

02Finance & Banking

03Healthcare Ops

04E-commerce

05Legal & Compliance

06HR & Payroll

07Search & Retrieval

08Document & Design

09Social Media

10Communication

11System & Cloud

12DevOps & CI/CD

13Education

14Real Estate

15Manufacturing

16Logistics

17Insurance

18Government

19Energy & Utilities

20Food & Hospitality

Citation

BibTeX

@article{dong2026agent,
  title     = {Agent-World: Scaling Real-World Environment
               Synthesis for Evolving General Agent
               Intelligence},
  author    = {Dong, Guanting and Lu, Junting and
               Huang, Junjie and Zhong, Wanjun and
               others},
  journal   = {arXiv preprint arXiv:2604.18292},
  year      = {2026}
}

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence. Dong et al., 2026. Renmin University of China & ByteDance Seed.

Frontier lab pilot

3 verticals. 150 environments. 4 weeks.

We package the environments, tasks, tools, and verifiers for your existing RL loop. You measure lift on your internal agent evals and decide whether to scale to continuous factory integration.

Scope the pilot

3 verticals, 50 environments each

SQLite-backed state with deterministic reset

Generated Python tools with unit tests

Grounded task sets with hidden answer keys

VCode verification scripts (3–12 assertions)

Gym-compatible rollout interface

The environmentsyour agentsare missing.