Realism gap
Internal mocks are CRUD over SQLite. Real enterprise software has seasonal pricing, loyalty programs, partial refunds, multi-currency, legacy data formats, and conflicting policies. Agents trained on clean mocks fail in messy production.
RL environment factory for frontier labs
You’ve solved reasoning and code. The next unlock is agents that reliably operate enterprise software — CRMs, ERPs, legal systems, procurement tools. The blocker isn’t your RL algorithms. It’s environment supply.
Frontier labs can build RL environments internally. But there are three constraints you keep hitting — and each one limits how fast your agents learn to operate real enterprise software.
Internal mocks are CRUD over SQLite. Real enterprise software has seasonal pricing, loyalty programs, partial refunds, multi-currency, legacy data formats, and conflicting policies. Agents trained on clean mocks fail in messy production.
Your research team can hand-build 20–50 environments. Agent-World showed that scaling from 100 to 2,000 environments produces a +20.1 point lift on agentic benchmarks. You need a factory, not a services team.
LLM-as-judge rewards are noisy, gameable, and not grounded. Your RL loop needs binary, executable signals: did the right state change happen, in the right system, under the right policy?
Three properties that define useful RL environments — and that manual construction cannot deliver at the scale frontier training demands.
Make every simulation error-heavy and edge-case-dense. Inject partial refunds that break state machines, policy conflicts that require judgment, legacy date formats that crash parsers. A quality gate runs 5 baseline attempts — only tasks scoring 2–4/5 enter your training set. Every environment is a surgical stress test targeting your model’s weaknesses.
No handcrafted mocks. No domain experts in the loop. No manual metrics. The factory compiles MCP specs, tool documentation, and industry data into complete environment families — databases, executable tools with unit tests, dependency graphs, grounded tasks, and VCode verifiers — autonomously and continuously.
Environments aren’t static fixtures. Database state layers with seasonal pricing, loyalty tiers, conflicting policies, and stale records. Seed data exactly as you anticipate production chaos. Each model action mutates real SQLite state — the complexity compounds under agent pressure, just like production.
Three commands. From your enterprise documentation to a Docker-based synthetic gym with executable rewards — ready for RL training or SFT data generation.
An anchor is any enterprise artifact: an MCP server spec, a PRD, API documentation, database schemas, or tool definitions. The factory ingests it and mines realistic data structures.
$ blobfish upload \
--prompt "Procurement approvals for vendor onboarding" \
--prd procurement_workflows.pdf \
--docs erp_api_reference/ \
--repo github.com/acme/procurement-ops
✓ Anchor ingested: prompt, PRD, docs, repo
✓ Domain model compiled: enterprise_procurement
✓ Complexity layers: budget caps, delegated approvers, stale vendorsThe factory compiles your anchor into complete, executable environments — each a Docker image with SQLite state, Python tools, dependency graphs, grounded tasks, and VCode verifiers.
$ blobfish generate \
--domain enterprise_procurement \
--variants 50 \
--complexity high
Building environments...
├── Generating 14 tools with unit tests ✓
├── Building weighted dependency graph ✓
├── Synthesizing 47 tasks (graph + prog) ✓
├── Attaching VCode verifiers (3-12 each) ✓
├── Running 5-run quality gate ✓
└── Packaging 50 Docker gym images ✓
✓ 50 gym images pushed to registryEach rollout gets an independent gym instance. The agent calls tools against real SQLite state. VCode computes binary rewards from database diffs. GRPO updates the policy.
$ blobfish train \
--gym enterprise_procurement:latest \
--algo grpo \
--model qwen3-14b \
--rollout-workers 8
Epoch 1/10: 400 rollouts across 50 variants
reward_mean: 0.34 → 0.61
tasks_solved: 142/400
checkpoint: gs://models/procurement-grpo-e1.ptRecord expert trajectories for fine-tuning smaller models without running RL. Export as SFT pairs or DPO triples with rejected trajectories.
$ blobfish distill \
--gym enterprise_procurement:latest \
--teacher claude-sonnet-4-5 \
--episodes 10000 \
--format sft,dpo
Recording expert trajectories...
├── 10,000 episodes completed
├── 8,247 successful (verified by VCode)
├── 1,753 failed (used for DPO rejected)
└── Exported:
sft_enterprise_procurement.jsonl (8,247 pairs)
dpo_enterprise_procurement.jsonl (1,753 triples)From raw MCP spec to rollout-ready gym in a single pipeline run. No human in the loop. Then the factory repeats this 1,978 times across 20 industry verticals.
The anchor locks down the meaning of a production system: its data, policies, tools, and failure modes. The factory branches from that source into controlled variants, tasks, and verifiers without hand-authoring every mock.
route_approval()approval_log +1exception pendingCapture the real system once: schema, roles, policies, tools, and known failure modes.
source -> invariant specVary seeds, personas, permissions, stale records, seasonal rules, and edge cases.
policies x personas x seedsGenerate reset state, executable tools, grounded tasks, answer keys, and rollout manifests.
state + tools + tasksScore database diffs and tool side effects with VCode assertions instead of LLM judgment.
task -> verifierStart from the artifacts your team already has: PRDs, MCP servers, API docs, GitHub repos, and optional company data. The factory compiles them into mock apps, assets, tasks, and deterministic verifiers that can run as isolated training worlds.
anchor.lock - world.ir - docker buildPRD, MCP, API docs, repo logic, and optional seed data are normalized into a domain anchor.
The factory generates Dockerized services, mock UI assets, state layers, tool contracts, and failure-heavy variants.
Every world includes tasks, answer keys, reset snapshots, and VCode rewards so rollouts have deterministic feedback.
Blobfish compiles domain artifacts into environment packs: databases, tools, tasks, reward code, and diagnostics — then self-evolves against your model’s weaknesses.
Deep research agents crawl real-world data structures: pricing, catalogs, policy docs, workflow examples, hiring posts, personas, and tool documentation. Then layer complexity until the world mirrors production chaos.
Coding agents write real Python tools with unit tests. A weighted dependency graph connects them (strong: 3, weak: 2, independent: 1). Graph walks and programmatic synthesis produce tasks with proven ground truth.
VCode scripts check what the agent did, not what it said. Direct database assertions verify rows, ledgers, permissions, tool side effects, and final state. 3–12 binary assertions per task. No LLM judgment.
Arena diagnosis reads all failure traces and outputs ranked weaknesses: brittle parameter passing, missed policy dependencies, wrong IDs, or stale state assumptions. The factory produces targeted environments for the next curriculum wave.
VCode verifies what the agent did, not whether the answer sounds plausible. Every task ships with direct state assertions and recorded ground truth from a live sandbox run.
observation = env.reset(variant_ref, task_ref)
result = env.call_tool(name, arguments)
snapshot = env.snapshot()
reward = vcode.verify(
initial_state,
final_state,
trajectory,
)Not a prompt template. Not a mock API. A real SQLite database with messy data, executable Python tools, a dependency graph, grounded tasks, and VCode verification scripts.
enterprise_world/
├── state.db # SQLite: 12 tables, policy-heavy state
├── tools/
│ ├── search_records.py # 94 lines, 8 params
│ ├── create_request.py # 127 lines, validates policy
│ ├── route_approval.py # 86 lines, delegated approver logic
│ └── ... # 14 tools total, all unit-tested
├── graph.json # weighted dependency DAG
├── tasks/
│ ├── task_001.json # "Find vendor, create PO, route approval"
│ ├── task_002.json # "Resolve policy exception and match invoice"
│ └── ... # 47 grounded tasks
├── vcode/
│ ├── verify_001.py # 8 assertions on final DB state
│ └── ... # binary reward per task
└── manifest.json # content-addressed, parent-linkedAgent-World-14B, trained on environments produced by this factory, approaches frontier API models on agentic benchmarks — with a 14B parameter model.
Average score across agentic benchmarks as environment count grows from 0 to 2,000. +20.1 point lift.
Each round: arena evaluates, diagnosis targets weaknesses, factory generates targeted environments, model trains again.
Environments span 20 primary industry categories, each with subcategories and domain-specific complexity layers. Three input sources: MCP servers, tool documentation, and industry PRDs.
@article{dong2026agent,
title = {Agent-World: Scaling Real-World Environment
Synthesis for Evolving General Agent
Intelligence},
author = {Dong, Guanting and Lu, Junting and
Huang, Junjie and Zhong, Wanjun and
others},
journal = {arXiv preprint arXiv:2604.18292},
year = {2026}
}Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence. Dong et al., 2026. Renmin University of China & ByteDance Seed.
We package the environments, tasks, tools, and verifiers for your existing RL loop. You measure lift on your internal agent evals and decide whether to scale to continuous factory integration.
Scope the pilot