RL environment factory for frontier labs

The environments
your agents
are missing.

You’ve solved reasoning and code. The next unlock is agents that reliably operate enterprise software — CRMs, ERPs, legal systems, procurement tools. The blocker isn’t your RL algorithms. It’s environment supply.

2,000+Environments
19K+Tools
23Benchmarks
20Categories
20,000+runnable environments from one factory
19,822executable tools
+20.1benchmark lift from scale
0LLM judges in reward path
The blocker is not your RL algorithm

Your next agent unlock is environment supply.

Frontier labs can build RL environments internally. But there are three constraints you keep hitting — and each one limits how fast your agents learn to operate real enterprise software.

01

Realism gap

Internal mocks are CRUD over SQLite. Real enterprise software has seasonal pricing, loyalty programs, partial refunds, multi-currency, legacy data formats, and conflicting policies. Agents trained on clean mocks fail in messy production.

02

Scale gap

Your research team can hand-build 20–50 environments. Agent-World showed that scaling from 100 to 2,000 environments produces a +20.1 point lift on agentic benchmarks. You need a factory, not a services team.

03

Reward gap

LLM-as-judge rewards are noisy, gameable, and not grounded. Your RL loop needs binary, executable signals: did the right state change happen, in the right system, under the right policy?

What Blobfish delivers

Controlled. Scalable. Chaotic.

Three properties that define useful RL environments — and that manual construction cannot deliver at the scale frontier training demands.

9 edge cases injected
Controlled

Seed failure modes your model can’t handle yet

Make every simulation error-heavy and edge-case-dense. Inject partial refunds that break state machines, policy conflicts that require judgment, legacy date formats that crash parsers. A quality gate runs 5 baseline attempts — only tasks scoring 2–4/5 enter your training set. Every environment is a surgical stress test targeting your model’s weaknesses.

20K
environments
Scalable

20,000 environments. Zero manual effort.

No handcrafted mocks. No domain experts in the loop. No manual metrics. The factory compiles MCP specs, tool documentation, and industry data into complete environment families — databases, executable tools with unit tests, dependency graphs, grounded tasks, and VCode verifiers — autonomously and continuously.

Realistic

State that evolves with every action

Environments aren’t static fixtures. Database state layers with seasonal pricing, loyalty tiers, conflicting policies, and stale records. Seed data exactly as you anticipate production chaos. Each model action mutates real SQLite state — the complexity compounds under agent pressure, just like production.

How it works for you

Upload an anchor. Get a gym. Train or distill.

Three commands. From your enterprise documentation to a Docker-based synthetic gym with executable rewards — ready for RL training or SFT data generation.

01

Upload your anchor

An anchor is any enterprise artifact: an MCP server spec, a PRD, API documentation, database schemas, or tool definitions. The factory ingests it and mines realistic data structures.

$ blobfish upload \
    --prompt "Procurement approvals for vendor onboarding" \
    --prd procurement_workflows.pdf \
    --docs erp_api_reference/ \
    --repo github.com/acme/procurement-ops

✓ Anchor ingested: prompt, PRD, docs, repo
✓ Domain model compiled: enterprise_procurement
✓ Complexity layers: budget caps, delegated approvers, stale vendors
02

Generate Docker gym environments

The factory compiles your anchor into complete, executable environments — each a Docker image with SQLite state, Python tools, dependency graphs, grounded tasks, and VCode verifiers.

$ blobfish generate \
    --domain enterprise_procurement \
    --variants 50 \
    --complexity high

Building environments...
  ├── Generating 14 tools with unit tests    ✓
  ├── Building weighted dependency graph     ✓
  ├── Synthesizing 47 tasks (graph + prog)   ✓
  ├── Attaching VCode verifiers (3-12 each)  ✓
  ├── Running 5-run quality gate             ✓
  └── Packaging 50 Docker gym images         ✓

✓ 50 gym images pushed to registry
03a

Run RL training against live gyms

Each rollout gets an independent gym instance. The agent calls tools against real SQLite state. VCode computes binary rewards from database diffs. GRPO updates the policy.

$ blobfish train \
    --gym enterprise_procurement:latest \
    --algo grpo \
    --model qwen3-14b \
    --rollout-workers 8

Epoch 1/10: 400 rollouts across 50 variants
  reward_mean: 0.34 → 0.61
  tasks_solved: 142/400
  checkpoint: gs://models/procurement-grpo-e1.pt
03b

Generate distillation data

Record expert trajectories for fine-tuning smaller models without running RL. Export as SFT pairs or DPO triples with rejected trajectories.

$ blobfish distill \
    --gym enterprise_procurement:latest \
    --teacher claude-sonnet-4-5 \
    --episodes 10000 \
    --format sft,dpo

Recording expert trajectories...
  ├── 10,000 episodes completed
  ├── 8,247 successful (verified by VCode)
  ├── 1,753 failed (used for DPO rejected)
  └── Exported:
      sft_enterprise_procurement.jsonl    (8,247 pairs)
      dpo_enterprise_procurement.jsonl    (1,753 triples)
Watch a world being built

One environment. Eight stages. Fully autonomous.

From raw MCP spec to rollout-ready gym in a single pipeline run. No human in the loop. Then the factory repeats this 1,978 times across 20 industry verticals.

blobfish factory — enterprise_procurement
0.0sINGESTuser prompt + PRD + API docs + GitHub repo + optional datastreaming
1.2sDATABASE12 tables · 2,400 rows — budgets, vendors, stale approvals, legacy IDscompiled
2.4sTOOLS14 functions — search_vendor ✓ create_po ✓ route_approval ✓ match_invoice ✓unit-tested
3.6sGRAPH14 nodes, 23 edges — strong:8 weak:10 independent:5connected
4.8sTASKS47 tasks — graph-walk:31 programmatic:16 — ground truth recordedgrounded
6.0sVCODE47 verifiers — 3-12 assertions each — DB state checks, no LLM judgeattached
7.2sQUALITY5-run baseline gate: 42 accepted (2-4/5) · 3 too easy · 2 too hardfiltered
8.4sGYM READYenterprise_procurement sealed — deterministic reset · content-addressed · immutable✓ deployed
× 1,978environments generated across 20 verticals — zero manual effort
World creation at scale

One real system becomes thousands of executable worlds.

The anchor locks down the meaning of a production system: its data, policies, tools, and failure modes. The factory branches from that source into controlled variants, tasks, and verifiers without hand-authoring every mock.

01Anchor lockedroles + schemas + policies
02State layeredstale rows + edge records
03Variants branchedpolicy x persona x seed
04Rewards sealedVCode + reset snapshots
schemapolicypersonaseedtaskVCode
MCPtool contract
Schemaentities
Policyguardrails
Graphtool path
State DBmutations
VCodereward
agent calls tools
job_anchor.lockprompt + PRD + MCP + docs + repo + policy failures
rolesentitiespoliciesschemastoolsvariantstasksVCode
branch axes
policystrict approval limits
personabusy operations lead
seedstale vendor record
state evolves after every action
tool_callroute_approval()
state_diffapproval_log +1
next_obsexception pending
compiled worlds20,000
VCode8/8 pass
lock: enterprise_job_v7branch: policies x personas x seedsemit: tools + tasks + VCode + reset state
01
Anchor

Lock the source

Capture the real system once: schema, roles, policies, tools, and known failure modes.

source -> invariant spec
02
Branch

Sweep the variants

Vary seeds, personas, permissions, stale records, seasonal rules, and edge cases.

policies x personas x seeds
03
Compile

Emit runnable worlds

Generate reset state, executable tools, grounded tasks, answer keys, and rollout manifests.

state + tools + tasks
04
Verify

Attach deterministic rewards

Score database diffs and tool side effects with VCode assertions instead of LLM judgment.

task -> verifier
1source anchor
40+world families
2,000+training variants
20,000executable worlds
World mock intake

User context becomes a Dockerized mock world.

Start from the artifacts your team already has: PRDs, MCP servers, API docs, GitHub repos, and optional company data. The factory compiles them into mock apps, assets, tasks, and deterministic verifiers that can run as isolated training worlds.

user inputs
PRDflows, roles, acceptance criteriarequired
MCPtool contracts and side effectsrequired
API docsschemas, endpoints, auth statesrequired
GitHub repobusiness logic and edge pathsoptional
Company dataseed records and assetsoptional
world compilernormalize - synthesize - packageanchor.lock - world.ir - docker build
generated artifacts
Docker worldcontainerized reset state
Mock app + assetsUI, fixtures, files, media
Task packsgrounded goals and answer keys
VerifierVCode reward and DB diffs
PRDMCPAPI docsGitHub repoCompany data
Docker worldMock app + assetsTask packsVerifier
build: docker run bf/world:<job_id>mount: app assets + fixtures + seeded dataemit: tasks + reset snapshots + VCode verifier
01

Upload the source material

PRD, MCP, API docs, repo logic, and optional seed data are normalized into a domain anchor.

02

Compile the mock world

The factory generates Dockerized services, mock UI assets, state layers, tool contracts, and failure-heavy variants.

03

Ship RL-ready artifacts

Every world includes tasks, answer keys, reset snapshots, and VCode rewards so rollouts have deterministic feedback.

Autonomous environment factory

From enterprise evidence to rollout-ready RL worlds.

Blobfish compiles domain artifacts into environment packs: databases, tools, tasks, reward code, and diagnostics — then self-evolves against your model’s weaknesses.

Discover

Mine real enterprise structure

Deep research agents crawl real-world data structures: pricing, catalogs, policy docs, workflow examples, hiring posts, personas, and tool documentation. Then layer complexity until the world mirrors production chaos.

entities · policies · messy data
Synthesize

Compile tools and task graphs

Coding agents write real Python tools with unit tests. A weighted dependency graph connects them (strong: 3, weak: 2, independent: 1). Graph walks and programmatic synthesis produce tasks with proven ground truth.

tools · tasks · variants
Verify

Attach executable rewards

VCode scripts check what the agent did, not what it said. Direct database assertions verify rows, ledgers, permissions, tool side effects, and final state. 3–12 binary assertions per task. No LLM judgment.

binary rewards · lineage
Evolve

Target model weaknesses

Arena diagnosis reads all failure traces and outputs ranked weaknesses: brittle parameter passing, missed policy dependencies, wrong IDs, or stale state assumptions. The factory produces targeted environments for the next curriculum wave.

curriculum updates
Executable rewards

No LLM-as-judge in the reward path.

VCode verifies what the agent did, not whether the answer sounds plausible. Every task ships with direct state assertions and recorded ground truth from a live sandbox run.

observation = env.reset(variant_ref, task_ref)
result = env.call_tool(name, arguments)
snapshot = env.snapshot()
reward = vcode.verify(
    initial_state,
    final_state,
    trajectory,
)
Inputvariant_ref + task_ref + seed
Actionagent calls tools against live state
Snapshotinitial state, final state, trajectory
RewardVCode assertions over database diffs
What an environment looks like

Every environment is a complete, executable world.

Not a prompt template. Not a mock API. A real SQLite database with messy data, executable Python tools, a dependency graph, grounded tasks, and VCode verification scripts.

enterprise_world/
├── state.db          # SQLite: 12 tables, policy-heavy state
├── tools/
│   ├── search_records.py     # 94 lines, 8 params
│   ├── create_request.py     # 127 lines, validates policy
│   ├── route_approval.py     # 86 lines, delegated approver logic
│   └── ...                   # 14 tools total, all unit-tested
├── graph.json        # weighted dependency DAG
├── tasks/
│   ├── task_001.json         # "Find vendor, create PO, route approval"
│   ├── task_002.json         # "Resolve policy exception and match invoice"
│   └── ...                   # 47 grounded tasks
├── vcode/
│   ├── verify_001.py         # 8 assertions on final DB state
│   └── ...                   # binary reward per task
└── manifest.json     # content-addressed, parent-linked
Database12 tables, 2,400 rows, policy exceptions, stale records, delegated roles
Tools14 Python functions with unit tests, >50% test accuracy required
GraphWeighted DAG: strong (3), weak (2), independent (1) edges
Tasks47 grounded tasks from graph walks + programmatic synthesis
VCode3–12 binary assertions per task, checks DB state not LLM output
QualityBaseline agent scores 2–4/5: solvable but challenging
Agent-World results · 23 benchmarks

Proven at scale. Published results.

Agent-World-14B, trained on environments produced by this factory, approaches frontier API models on agentic benchmarks — with a 14B parameter model.

61.8%
Agent-World-14B on τ²-Bench
Approaches frontier API models on real tool-use benchmarks
55.8%
Agent-World-14B on BFCL V4
Beats DeepSeek-V3.2-685B (54.1%) with a 14B model
+8.6%
Two self-evolution rounds on MCP-Mark
Arena diagnosis → targeted generation → measurable lift

Benchmark comparison

ModelSizeMCP-MarkBFCL V4τ²-Bench
GPT-5.2 HighProprietary53.1%
Claude Sonnet-4.5Proprietary
DeepSeek-V3.2-685B685B44.6%54.1%56.2%
Qwen3-14B14B25.3%45.9%42.3%
Agent-World-14B14B38.1%55.8%50.5%
Agent-World-8B8B28.4%49.1%61.8%

Scaling: environments → performance

Average score across agentic benchmarks as environment count grows from 0 to 2,000. +20.1 point lift.

18.4%0
23.2%250
27.8%500
32.1%1K
35.6%1.5K
38.5%2K

Self-evolution (Agent-World-14B)

Each round: arena evaluates, diagnosis targets weaknesses, factory generates targeted environments, model trains again.

RoundMCP-MarkBFCL V4τ²-Bench
Base29.5%52.4%45.3%
+1 round36.3% (+6.8)54.9% (+2.5)48.6% (+3.3)
+2 rounds38.1% (+1.8)55.8% (+0.9)50.5% (+1.9)
20 categories · 50 subcategories · 1,978 environments

Coverage across enterprise verticals.

Environments span 20 primary industry categories, each with subcategories and domain-specific complexity layers. Three input sources: MCP servers, tool documentation, and industry PRDs.

01Travel & Booking
02Finance & Banking
03Healthcare Ops
04E-commerce
05Legal & Compliance
06HR & Payroll
07Search & Retrieval
08Document & Design
09Social Media
10Communication
11System & Cloud
12DevOps & CI/CD
13Education
14Real Estate
15Manufacturing
16Logistics
17Insurance
18Government
19Energy & Utilities
20Food & Hospitality
Citation

BibTeX

@article{dong2026agent,
  title     = {Agent-World: Scaling Real-World Environment
               Synthesis for Evolving General Agent
               Intelligence},
  author    = {Dong, Guanting and Lu, Junting and
               Huang, Junjie and Zhong, Wanjun and
               others},
  journal   = {arXiv preprint arXiv:2604.18292},
  year      = {2026}
}

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence. Dong et al., 2026. Renmin University of China & ByteDance Seed.

Frontier lab pilot

3 verticals. 150 environments. 4 weeks.

We package the environments, tasks, tools, and verifiers for your existing RL loop. You measure lift on your internal agent evals and decide whether to scale to continuous factory integration.

Scope the pilot
3 verticals, 50 environments each
SQLite-backed state with deterministic reset
Generated Python tools with unit tests
Grounded task sets with hidden answer keys
VCode verification scripts (3–12 assertions)
Gym-compatible rollout interface