MarvelCQ commited on
Commit
d84ca00
·
verified ·
1 Parent(s): ba9992c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -13,7 +13,7 @@ license: mit
13
  </p>
14
  </div>
15
 
16
- ### Instruction
17
  We propose DianJin-R1, a novel framework that enhances financial reasoning in LLMs through reasoning-augmented supervision and reinforcement learning. Central to our approach is DianJin-R1-Data, a high-quality dataset constructed from CFLUE, FinQA, and a proprietary compliance corpus (Chinese Compliance Check, CCC), combining diverse financial reasoning scenarios with verified annotations. We adopt a structured training paradigm where models generate reasoning steps and final answers using supervised fine-tuning. To further improve reasoning quality, we use Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that incorporates dual reward signals for output structure and answer accuracy. \
18
  \
19
  We open-source our models, DianJin-R1-7B and DianJin-R1-32B, based on Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct, which train by two steps: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).
 
13
  </p>
14
  </div>
15
 
16
+ ### Introduction
17
  We propose DianJin-R1, a novel framework that enhances financial reasoning in LLMs through reasoning-augmented supervision and reinforcement learning. Central to our approach is DianJin-R1-Data, a high-quality dataset constructed from CFLUE, FinQA, and a proprietary compliance corpus (Chinese Compliance Check, CCC), combining diverse financial reasoning scenarios with verified annotations. We adopt a structured training paradigm where models generate reasoning steps and final answers using supervised fine-tuning. To further improve reasoning quality, we use Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that incorporates dual reward signals for output structure and answer accuracy. \
18
  \
19
  We open-source our models, DianJin-R1-7B and DianJin-R1-32B, based on Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct, which train by two steps: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).