Agents-K1

Knowledge extraction model in Agents-K1 is a 4B-parameter language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507 with GRPO (Group Relative Policy Optimization) on the information-extraction corpus, targeting Named Entity Recognition (NER) and Relation Extraction (RE) in English scientific and general-domain text.

The model produces structured JSON extractions with explicit step-by-step reasoning, enabling its use as a building block in downstream knowledge-graph construction, citation linking, and multi-hop QA pipelines.

Highlights

  • +3.3 absolute F1 averaged over 10 NER/RE benchmarks vs. the Qwen3-4B-Instruct base model, with gains on every dataset evaluated (including held-out CrossNER domains).
  • Trained with rule-based rewards (format + JSON validity + entity/relation F1), no human preference data required.
  • Outputs follow a strict <think>โ€ฆ</think><answer>โ€ฆ</answer> schema, making reasoning auditable and JSON parsing reliable.

Intended use

Designed as an extraction backbone for:

  • Scientific-literature mining (entities/relations in biomedicine, chemistry, CS, etc.)
  • Knowledge-graph construction
  • Pre-processing for retrieval / multi-hop QA systems

Not intended for general-purpose chat โ€” it has been specialized for structured extraction.

Usage

The model uses the same chat template as Qwen3-4B-Instruct and expects a schema-driven user prompt. The reply will contain a <think> block followed by an <answer> block with a JSON object.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "InternScience/Agents-K1"
tok   = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16", device_map="auto")

system = (
    "You are an expert in information extraction. Given a task instruction "
    "with schema definitions and input text, extract the required information.\n\n"
    "You should think step by step about the extraction task, then provide "
    "your answer in JSON format.\n\n"
    "Format your response as:\n"
    "<think>\nYour step-by-step reasoning...\n</think>\n"
    "<answer>\nYour JSON extraction result here\n</answer>"
)

user = (
    "You are an expert in named entity recognition. Please extract entities "
    "that match the schema definition from the input. Return an empty list if "
    "the entity type does not exist. Please respond in the format of a JSON "
    "dictionary.\n\n"
    'Entity types to extract: ["person", "organization", "location"]\n\n'
    "Input text: Marie Curie worked at the University of Paris.\n\n"
    "Please think step by step and respond in the following format:\n"
    "<think>\nYour reasoning process...\n</think>\n"
    "<answer>\nYour JSON extraction result\n</answer>"
)

messages = [{"role": "system", "content": system},
            {"role": "user",   "content": user}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True,
                                 return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=False)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

For RE, replace the user template with Relation types to extract: [...] and a relation-extraction instruction; the output schema is a JSON dict mapping relation types to lists of {head, tail} pairs.

Training data

Training data comes from IEPile, restricted to:

  • English NER and RE tasks
  • 22 source datasets, mixing scientific (SciERC, GENIA_NER, BC5CDR, BC2GM, BC4CHEMD, AnatEM, NCBI) and general-domain (CoNLL2003, conll04, FabNER, MultiNERD, NYT11, kbp37, โ€ฆ) corpora
Split Size Notes
Train 14,400 90/10 split, seed=42; each source capped to balance the mix
Validation 1,600

70% of samples have non-empty gold labels; 30% are empty-label cases (to prevent the model from defaulting to non-empty outputs).

Training procedure

  • Algorithm: GRPO (PPO without a critic), implemented in veRL.
  • Reward โˆˆ [0, 1]:
    • format reward: 0.1 ยท ๐Ÿ™[has <think>] + 0.1 ยท ๐Ÿ™[has <answer>]
    • JSON validity: 0.1 ยท ๐Ÿ™[valid JSON dict] (or 0.05 for non-dict valid JSON)
    • task F1: 0.7 ยท F1(pred, gold) โ€” entity-set F1 for NER, triple-set F1 for RE

Evaluation

Reported numbers are micro-F1 on each benchmark's official test split, using the same prompt template as training. Gains are base โ†’ Agents-K1 (GRPO).

Dataset Task n Base F1 Agent-K1 F1 ฮ”
CoNLL2003 NER 3,184 0.6547 0.7007 +0.046
NCBI-Disease NER 937 0.6737 0.7340 +0.060
BC5CDR NER 4,788 0.7126 0.7494 +0.037
CrossNER โ€” AI (held-out) NER 430 0.4862 0.5400 +0.054
CrossNER โ€” Literature (held) NER 416 0.5462 0.5736 +0.027
CrossNER โ€” Music (held) NER 457 0.5791 0.6050 +0.026
CrossNER โ€” Politics (held) NER 650 0.6611 0.6855 +0.024
CrossNER โ€” Science (held) NER 532 0.5928 0.6132 +0.020
SciERC NER 397 0.1166 0.1270 +0.010
conll04 RE 287 0.2933 0.3181 +0.025
Average 0.5317 0.5647 +0.033

All 10/10 benchmarks improve, including the 5 CrossNER domains that are not in the training mix โ€” evidence of generalization rather than mere fitting to in-distribution sources.

Limitations

  • Schema-driven prompting required. Free-form questions will likely return malformed JSON; always supply explicit entity / relation type lists.

License

Released under the Apache-2.0 license, following the upstream Qwen3-4B-Instruct-2507 license. Users must also comply with the licenses of the IEPile component datasets when using this model in derivative works.

Downloads last month
22
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for InternScience/Agents-K1

Finetuned
(1732)
this model