YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

ForgeEnv πŸ”§

A self-improving RL environment that teaches LLMs to fix HuggingFace training scripts as the ecosystem evolves.

ForgeEnv is an OpenEnv-compliant environment for the OpenEnv Hackathon (India 2026), theme #4 β€” Self-Improvement. Two LLM roles co-evolve inside a single environment:

  • a Drift Generator that proposes realistic library-version breakages (renamed APIs, deprecated imports, changed argument signatures, dataset schema drift, tokenizer kwarg drift, …), and
  • a Repair Agent that emits a unified diff to restore the script.

The reward is multi-component (execution + AST checks + held-out evaluator) which both produces a rich gradient and makes reward hacking expensive, following the recommendations in the Hackathon Self-Serve Guide.

Why it matters

LLM agents that write training code today are silently broken by HF library upgrades β€” a Trainer.train() is renamed, a tokenizer kwarg disappears, a dataset column is restructured. Today, humans patch these. ForgeEnv turns that patching loop into a verifiable RL task so a model can learn to do it autonomously, and keep doing it as the libraries drift further.

Live links

Artifact URL
Environment Space (Docker) https://huggingface.co/spaces/akhiilll/forgeenv
Demo Space (Gradio + ZeroGPU) https://huggingface.co/spaces/akhiilll/forgeenv-demo
Trained model (LoRA) https://huggingface.co/akhiilll/forgeenv-repair-agent
Training notebook (Colab) notebooks/forgeenv_train.ipynb

Architecture

                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚  Teacher (deter- β”‚     curriculum β†’
                 β”‚  ministic)       β”‚     {RenameApiCall, DeprecateImport, …}
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚ target_category
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ForgeEnvironment (OpenEnv)                                      β”‚
β”‚   reset()  β†’  drift_gen obs (script, target_category)           β”‚
β”‚   step(BreakageAction)  β†’  repair obs (broken_script, trace)    β”‚
β”‚   step(RepairAction)    β†’  reward, breakdown, held-out scores   β”‚
β”‚                                                                 β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚   β”‚ Drift Generator   β”‚    β”‚ Repair Agent         β”‚            β”‚
β”‚   β”‚ (LLM, GRPO)       β”‚    β”‚ (LLM, GRPO + SFT)    β”‚            β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                                                                 β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚   β”‚ Simulator (AST + heuristic exec) + Visible Verifier   β”‚    β”‚
β”‚   β”‚ + Held-out Evaluator + Library Drift Engine            β”‚    β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The two-step episode flow (Phase 1 = drift, Phase 2 = repair) is exactly the Challenger / Solver loop from R-Zero, with role-switched prompts Γ  la SPIRAL and Absolute Zero Reasoner.

Reward design

visible_reward
 β”œβ”€ execution_success        (sandboxed run / heuristic simulator)
 β”œβ”€ ast_well_formed          (parses + no forbidden globals)
 β”œβ”€ format_compliance        (valid unified diff or full-script replacement)
 β”œβ”€ minimality               (smaller diffs preferred β€” anti-rewrite)
 └─ no_forbidden_globals     (locked-down execution check)

held_out_evaluator (NOT used for training, used for evals only)
 β”œβ”€ executed_cleanly
 β”œβ”€ matches_target_api       (semantic correctness)
 └─ regression_free          (other tests still pass)

Multiple independent components, plus a held-out evaluator the trainer never sees, so the agent can't game its way to the top of the curve.

Results (50 episodes / agent, oracle as upper-bound proxy for trained)

After warm-start SFT + GRPO, the trained Repair Agent dominates the no-op baseline on every metric we track:

Agent Mean visible reward Success rate (held-out exec)
Baseline (no-op) 0.90 50 %
Trained (oracle) 1.51 86 %

Three plots (committed to artifacts/plots/):

  • baseline_vs_trained.png β€” reward distribution, baseline vs trained.
  • training_reward_curve.png β€” reward trajectory across episodes.
  • success_by_category.png β€” per-primitive success rates.

A 43-entry repair_library.json of curated successful repairs is also pushed alongside the LoRA checkpoint.

Quick start

# 1. install (env-only deps, no torch needed for the env itself)
pip install -e .[openenv]
pip install -e .[dev]

# 2. run the test suite
pytest -q                 # 74 tests β€” full env + roles + reward + training

# 3. spin up the environment locally
uvicorn forgeenv.env.server:app --port 7860

# 4. generate the demo artifacts (plots + repair_library.json + eval JSON)
python scripts/generate_artifacts.py --n_baseline 50 --n_trained 50

# 5. push to HF Spaces
export HF_TOKEN=hf_...
python scripts/deploy_spaces.py --user akhiilll

Training (warm-start SFT + GRPO via TRL + Unsloth) lives entirely in notebooks/forgeenv_train.ipynb β€” open it on Colab with a T4 or A100 and re-run end-to-end.

Repository layout

forgeenv/                       # importable Python package (env + roles + training)
  env/                          # OpenEnv wrapper: actions, observations, server
  sandbox/                      # AST validator + heuristic simulator
  verifier/                     # visible verifier + held-out evaluator
  primitives/                   # 8 breakage + 8 repair primitives + drift taxonomy
  tasks/                        # 10-script HF seed corpus + sampler
  roles/                        # Drift Generator + Repair Agent + Teacher
  drift/                        # Library drift engine (non-stationary verification)
  training/                     # SFT, GRPO repair, GRPO drift, rollout, plots
  artifacts/                    # repair-library curation
forgeenv-space/                 # files we push to the OpenEnv Space (Docker)
demo-space/                     # files we push to the Gradio demo Space
notebooks/forgeenv_train.ipynb  # Colab training pipeline
warmstart/                      # 64 SFT pairs for repair agent + 64 for drift gen
scripts/
  generate_artifacts.py         # plots + eval_results.json + repair_library.json
  deploy_spaces.py              # one-shot push to HF Spaces
artifacts/                      # generated plots + curated repair library
tests/                          # 74 pytest tests

Anti-cheat / reward-hacking safeguards

Following the Hackathon Self-Serve Guide explicitly:

  1. Multiple independent reward functions (5 visible + 3 held-out).
  2. Held-out evaluator the trainer never sees, used only for plots.
  3. Locked-down execution in the sandbox simulator β€” no globals abuse, timeouts on every run.
  4. AST validator rejects forbidden constructs (network calls, os.system, etc.) before reward is computed.
  5. Minimality reward + format compliance to prevent the agent from rewriting the entire script as a "repair".
  6. The Drift Generator is itself trained against an R-Zero composite reward (uncertainty βˆ’ repetition) so it can't trivially game the agent.

References

  • Huang et al., R-Zero: Self-Evolving Reasoning LLM From Zero Data (2025)
  • Zhao et al., Absolute Zero: Reinforced Self-play Reasoning with Zero Data (2025)
  • Liu et al., SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning… (2025)
  • Ibrahim et al., arXiv:2408.10215 β€” Reward engineering & shaping
  • Masud et al., arXiv:2601.19100 β€” Reward engineering for RL in software tasks
  • OpenEnv Hackathon Self-Serve Guide (2026)

License

Apache-2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for akhiilll/forgeenv-source