YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
ForgeEnv π§
A self-improving RL environment that teaches LLMs to fix HuggingFace training scripts as the ecosystem evolves.
ForgeEnv is an OpenEnv-compliant environment for the OpenEnv Hackathon (India 2026), theme #4 β Self-Improvement. Two LLM roles co-evolve inside a single environment:
- a Drift Generator that proposes realistic library-version breakages (renamed APIs, deprecated imports, changed argument signatures, dataset schema drift, tokenizer kwarg drift, β¦), and
- a Repair Agent that emits a unified diff to restore the script.
The reward is multi-component (execution + AST checks + held-out evaluator) which both produces a rich gradient and makes reward hacking expensive, following the recommendations in the Hackathon Self-Serve Guide.
Why it matters
LLM agents that write training code today are silently broken by HF library
upgrades β a Trainer.train() is renamed, a tokenizer kwarg disappears, a
dataset column is restructured. Today, humans patch these. ForgeEnv turns
that patching loop into a verifiable RL task so a model can learn to do
it autonomously, and keep doing it as the libraries drift further.
Live links
| Artifact | URL |
|---|---|
| Environment Space (Docker) | https://huggingface.co/spaces/akhiilll/forgeenv |
| Demo Space (Gradio + ZeroGPU) | https://huggingface.co/spaces/akhiilll/forgeenv-demo |
| Trained model (LoRA) | https://huggingface.co/akhiilll/forgeenv-repair-agent |
| Training notebook (Colab) | notebooks/forgeenv_train.ipynb |
Architecture
ββββββββββββββββββββ
β Teacher (deter- β curriculum β
β ministic) β {RenameApiCall, DeprecateImport, β¦}
ββββββββββββββββββββ
β target_category
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ForgeEnvironment (OpenEnv) β
β reset() β drift_gen obs (script, target_category) β
β step(BreakageAction) β repair obs (broken_script, trace) β
β step(RepairAction) β reward, breakdown, held-out scores β
β β
β βββββββββββββββββββββ ββββββββββββββββββββββββ β
β β Drift Generator β β Repair Agent β β
β β (LLM, GRPO) β β (LLM, GRPO + SFT) β β
β βββββββββββββββββββββ ββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Simulator (AST + heuristic exec) + Visible Verifier β β
β β + Held-out Evaluator + Library Drift Engine β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The two-step episode flow (Phase 1 = drift, Phase 2 = repair) is exactly the Challenger / Solver loop from R-Zero, with role-switched prompts Γ la SPIRAL and Absolute Zero Reasoner.
Reward design
visible_reward
ββ execution_success (sandboxed run / heuristic simulator)
ββ ast_well_formed (parses + no forbidden globals)
ββ format_compliance (valid unified diff or full-script replacement)
ββ minimality (smaller diffs preferred β anti-rewrite)
ββ no_forbidden_globals (locked-down execution check)
held_out_evaluator (NOT used for training, used for evals only)
ββ executed_cleanly
ββ matches_target_api (semantic correctness)
ββ regression_free (other tests still pass)
Multiple independent components, plus a held-out evaluator the trainer never sees, so the agent can't game its way to the top of the curve.
Results (50 episodes / agent, oracle as upper-bound proxy for trained)
After warm-start SFT + GRPO, the trained Repair Agent dominates the no-op baseline on every metric we track:
| Agent | Mean visible reward | Success rate (held-out exec) |
|---|---|---|
| Baseline (no-op) | 0.90 | 50 % |
| Trained (oracle) | 1.51 | 86 % |
Three plots (committed to artifacts/plots/):
baseline_vs_trained.pngβ reward distribution, baseline vs trained.training_reward_curve.pngβ reward trajectory across episodes.success_by_category.pngβ per-primitive success rates.
A 43-entry repair_library.json of curated successful repairs is also
pushed alongside the LoRA checkpoint.
Quick start
# 1. install (env-only deps, no torch needed for the env itself)
pip install -e .[openenv]
pip install -e .[dev]
# 2. run the test suite
pytest -q # 74 tests β full env + roles + reward + training
# 3. spin up the environment locally
uvicorn forgeenv.env.server:app --port 7860
# 4. generate the demo artifacts (plots + repair_library.json + eval JSON)
python scripts/generate_artifacts.py --n_baseline 50 --n_trained 50
# 5. push to HF Spaces
export HF_TOKEN=hf_...
python scripts/deploy_spaces.py --user akhiilll
Training (warm-start SFT + GRPO via TRL + Unsloth) lives entirely in
notebooks/forgeenv_train.ipynb β open
it on Colab with a T4 or A100 and re-run end-to-end.
Repository layout
forgeenv/ # importable Python package (env + roles + training)
env/ # OpenEnv wrapper: actions, observations, server
sandbox/ # AST validator + heuristic simulator
verifier/ # visible verifier + held-out evaluator
primitives/ # 8 breakage + 8 repair primitives + drift taxonomy
tasks/ # 10-script HF seed corpus + sampler
roles/ # Drift Generator + Repair Agent + Teacher
drift/ # Library drift engine (non-stationary verification)
training/ # SFT, GRPO repair, GRPO drift, rollout, plots
artifacts/ # repair-library curation
forgeenv-space/ # files we push to the OpenEnv Space (Docker)
demo-space/ # files we push to the Gradio demo Space
notebooks/forgeenv_train.ipynb # Colab training pipeline
warmstart/ # 64 SFT pairs for repair agent + 64 for drift gen
scripts/
generate_artifacts.py # plots + eval_results.json + repair_library.json
deploy_spaces.py # one-shot push to HF Spaces
artifacts/ # generated plots + curated repair library
tests/ # 74 pytest tests
Anti-cheat / reward-hacking safeguards
Following the Hackathon Self-Serve Guide explicitly:
- Multiple independent reward functions (5 visible + 3 held-out).
- Held-out evaluator the trainer never sees, used only for plots.
- Locked-down execution in the sandbox simulator β no globals abuse, timeouts on every run.
- AST validator rejects forbidden constructs (network calls,
os.system, etc.) before reward is computed. - Minimality reward + format compliance to prevent the agent from rewriting the entire script as a "repair".
- The Drift Generator is itself trained against an R-Zero composite reward (uncertainty β repetition) so it can't trivially game the agent.
References
- Huang et al., R-Zero: Self-Evolving Reasoning LLM From Zero Data (2025)
- Zhao et al., Absolute Zero: Reinforced Self-play Reasoning with Zero Data (2025)
- Liu et al., SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning⦠(2025)
- Ibrahim et al., arXiv:2408.10215 β Reward engineering & shaping
- Masud et al., arXiv:2601.19100 β Reward engineering for RL in software tasks
- OpenEnv Hackathon Self-Serve Guide (2026)
License
Apache-2.0