Update README.md
Browse files
README.md
CHANGED
|
@@ -13,7 +13,7 @@ license: mit
|
|
| 13 |
</p>
|
| 14 |
</div>
|
| 15 |
|
| 16 |
-
###
|
| 17 |
We propose DianJin-R1, a novel framework that enhances financial reasoning in LLMs through reasoning-augmented supervision and reinforcement learning. Central to our approach is DianJin-R1-Data, a high-quality dataset constructed from CFLUE, FinQA, and a proprietary compliance corpus (Chinese Compliance Check, CCC), combining diverse financial reasoning scenarios with verified annotations. We adopt a structured training paradigm where models generate reasoning steps and final answers using supervised fine-tuning. To further improve reasoning quality, we use Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that incorporates dual reward signals for output structure and answer accuracy. \
|
| 18 |
\
|
| 19 |
We open-source our models, DianJin-R1-7B and DianJin-R1-32B, based on Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct, which train by two steps: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).
|
|
|
|
| 13 |
</p>
|
| 14 |
</div>
|
| 15 |
|
| 16 |
+
### Introduction
|
| 17 |
We propose DianJin-R1, a novel framework that enhances financial reasoning in LLMs through reasoning-augmented supervision and reinforcement learning. Central to our approach is DianJin-R1-Data, a high-quality dataset constructed from CFLUE, FinQA, and a proprietary compliance corpus (Chinese Compliance Check, CCC), combining diverse financial reasoning scenarios with verified annotations. We adopt a structured training paradigm where models generate reasoning steps and final answers using supervised fine-tuning. To further improve reasoning quality, we use Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that incorporates dual reward signals for output structure and answer accuracy. \
|
| 18 |
\
|
| 19 |
We open-source our models, DianJin-R1-7B and DianJin-R1-32B, based on Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct, which train by two steps: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).
|