Upload folder using huggingface_hub
Browse files- .gitattributes +5 -0
- README.md +76 -0
- added_tokens.json +1021 -0
- chat_template.jinja +46 -0
- config.json +74 -0
- configuration_paddleocr_vl.py +191 -0
- examples/01.png +3 -0
- examples/02.png +3 -0
- examples/03.png +0 -0
- examples/04.png +3 -0
- examples/05.png +3 -0
- generation_config.json +7 -0
- image_processing.py +563 -0
- model.safetensors +3 -0
- modeling_paddleocr_vl.py +0 -0
- preprocessor_config.json +33 -0
- processing_paddleocr_vl.py +293 -0
- processor_config.json +6 -0
- special_tokens_map.json +58 -0
- tokenizer.json +3 -0
- tokenizer.model +3 -0
- tokenizer_config.json +0 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
examples/01.png filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
examples/02.png filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
examples/04.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
examples/05.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-text-to-text
|
| 4 |
+
tags:
|
| 5 |
+
- PaddleOCR
|
| 6 |
+
- OCR
|
| 7 |
+
- Manga
|
| 8 |
+
base_model: PaddlePaddle/PaddleOCR-VL
|
| 9 |
+
language:
|
| 10 |
+
- ja
|
| 11 |
+
- multilingual
|
| 12 |
+
library_name: PaddleOCR
|
| 13 |
+
datasets:
|
| 14 |
+
- hal-utokyo/Manga109-s
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# PaddleOCR-VL-For-Manga
|
| 18 |
+
|
| 19 |
+
## Model Description
|
| 20 |
+
|
| 21 |
+
PaddleOCR-VL-For-Manga is an OCR model enhanced for Japanese manga text recognition. It is fine-tuned from [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) and achieves much higher accuracy on manga speech bubbles and stylized fonts.
|
| 22 |
+
|
| 23 |
+
This model was fine-tuned on a combination of the [Manga109-s dataset](http://www.manga109.org/) and 1.5 million synthetic data samples. It showcases the potential of Supervised Fine-Tuning (SFT) to create highly accurate, domain-specific VLMs for OCR tasks from a powerful, general-purpose base like [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL), which supports 109 languages.
|
| 24 |
+
|
| 25 |
+
This project serves as a practical guide for developers looking to build their own custom OCR solutions.
|
| 26 |
+
|
| 27 |
+
## Performance
|
| 28 |
+
|
| 29 |
+
The model achieves a **70% full-sentence accuracy** on a test set of Manga109-s crops (representing a 10% split of the dataset). For comparison, the original PaddleOCR-VL on the same test dataset achieves 27% full sentence accuracy.
|
| 30 |
+
|
| 31 |
+
Common errors involve discrepancies between visually similar characters that are often used interchangeably, such as:
|
| 32 |
+
|
| 33 |
+
- `!?` vs. `!?` (Full-width vs. half-width punctuation)
|
| 34 |
+
- `OK` vs. `ok` (Full-width vs. half-width letters)
|
| 35 |
+
- `1205` vs. `1205` (Full-width vs. half-width numbers)
|
| 36 |
+
- “人” (U+4EBA) vs. “⼈” (U+2F08) (Standard CJK Unified Ideograph vs. CJK Radical)
|
| 37 |
+
|
| 38 |
+
The prevalence of these character types highlights a limitation of standard metrics like Character Error Rate (CER). These metrics may not fully capture the model's practical accuracy, as they penalize semantically equivalent variations that are common in stylized text.
|
| 39 |
+
|
| 40 |
+
## Examples
|
| 41 |
+
|
| 42 |
+
| # | Image | Prediction |
|
| 43 |
+
|---|---|---|
|
| 44 |
+
| 1 |  | 心拍呼吸正常値<br>お人よし度過剰値...<br>間違いなく<br>パパッ...!<br>生存確認っ...! |
|
| 45 |
+
| 2 |  | あとは『メルニィ<br>宇宙鉄道』とか<br>『TipTap』とか<br>全部その人が<br>考えたらしい |
|
| 46 |
+
| 3 |  | ★コミックス20巻1月4日(土)発売〟TVアニメ1月11日(土)放送開始!! |
|
| 47 |
+
| 4 |  | 我々魔女協会が<br>長年追い続ける<br>最大の敵<br>ウロロが「王の魔法」なら<br>あれは世界を削り変える<br>「神の魔法」 |
|
| 48 |
+
| 5 |  | 天弓の動きについてくだけじゃ勝てねぇ…! |
|
| 49 |
+
|
| 50 |
+
## How to Use
|
| 51 |
+
|
| 52 |
+
You can use this model with the [transformers](https://github.com/huggingface/transformers), [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR), or any library that supports [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) to perform OCR on manga images. The model architecture and weights layout are identical to the base model.
|
| 53 |
+
|
| 54 |
+
If your application involves documents with structured layouts, you can use your fine-tuned OCR model in conjunction with [PP-DocLayoutV2](https://huggingface.co/PaddlePaddle/PaddleOCR-VL/tree/main/PP-DocLayoutV2/) for layout analysis. However, for manga, the reading order and layout are quite different.
|
| 55 |
+
|
| 56 |
+
## Training Details
|
| 57 |
+
|
| 58 |
+
- **Base Model**: [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL)
|
| 59 |
+
- **Dataset**:
|
| 60 |
+
- [Manga109-s](http://www.manga109.org/): 0.1 million randomly sampled text-region crops (not full pages) were used for training (90% split); the remaining 10% crops were used for testing.
|
| 61 |
+
- Synthetic Data: 1.5 million generated samples.
|
| 62 |
+
- **Training Frameworks**:
|
| 63 |
+
- [transformers](https://github.com/huggingface/transformers) and [trl](https://github.com/huggingface/trl)
|
| 64 |
+
- **Alternatives for SFT**:
|
| 65 |
+
- [ERNIEKit](https://github.com/PaddlePaddle/ERNIE)
|
| 66 |
+
- [ms-swift](https://github.com/modelscope/swift)
|
| 67 |
+
|
| 68 |
+
## Acknowledgements
|
| 69 |
+
|
| 70 |
+
- [Manga109-s](http://www.manga109.org/) dataset, which provided the manga text-region crops used for training and evaluation.
|
| 71 |
+
- [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL), the base Visual Language Model on which this model is fine-tuned.
|
| 72 |
+
- [manga-ocr](https://github.com/kha-white/manga-ocr), used in this project for data processing and synthetic data generation; it also inspired practical workflows and evaluation considerations for manga OCR.
|
| 73 |
+
|
| 74 |
+
## License
|
| 75 |
+
|
| 76 |
+
This model is licensed under the **Apache 2.0** license.
|
added_tokens.json
ADDED
|
@@ -0,0 +1,1021 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"<ecel>": 101308,
|
| 3 |
+
"<fcel>": 101309,
|
| 4 |
+
"<lcel>": 101311,
|
| 5 |
+
"<nl>": 101313,
|
| 6 |
+
"<ucel>": 101312,
|
| 7 |
+
"<xcel>": 101310,
|
| 8 |
+
"<|AUDIO_PLACEHOLDER|>": 100296,
|
| 9 |
+
"<|CROP_COL_SEP|>": 101301,
|
| 10 |
+
"<|CROP_ROW_SEP|>": 101302,
|
| 11 |
+
"<|IMAGE_END|>": 101306,
|
| 12 |
+
"<|IMAGE_PLACEHOLDER|>": 100295,
|
| 13 |
+
"<|IMAGE_SEP|>": 101303,
|
| 14 |
+
"<|IMAGE_START|>": 101305,
|
| 15 |
+
"<|LOC_0|>": 100297,
|
| 16 |
+
"<|LOC_1000|>": 101297,
|
| 17 |
+
"<|LOC_100|>": 100397,
|
| 18 |
+
"<|LOC_101|>": 100398,
|
| 19 |
+
"<|LOC_102|>": 100399,
|
| 20 |
+
"<|LOC_103|>": 100400,
|
| 21 |
+
"<|LOC_104|>": 100401,
|
| 22 |
+
"<|LOC_105|>": 100402,
|
| 23 |
+
"<|LOC_106|>": 100403,
|
| 24 |
+
"<|LOC_107|>": 100404,
|
| 25 |
+
"<|LOC_108|>": 100405,
|
| 26 |
+
"<|LOC_109|>": 100406,
|
| 27 |
+
"<|LOC_10|>": 100307,
|
| 28 |
+
"<|LOC_110|>": 100407,
|
| 29 |
+
"<|LOC_111|>": 100408,
|
| 30 |
+
"<|LOC_112|>": 100409,
|
| 31 |
+
"<|LOC_113|>": 100410,
|
| 32 |
+
"<|LOC_114|>": 100411,
|
| 33 |
+
"<|LOC_115|>": 100412,
|
| 34 |
+
"<|LOC_116|>": 100413,
|
| 35 |
+
"<|LOC_117|>": 100414,
|
| 36 |
+
"<|LOC_118|>": 100415,
|
| 37 |
+
"<|LOC_119|>": 100416,
|
| 38 |
+
"<|LOC_11|>": 100308,
|
| 39 |
+
"<|LOC_120|>": 100417,
|
| 40 |
+
"<|LOC_121|>": 100418,
|
| 41 |
+
"<|LOC_122|>": 100419,
|
| 42 |
+
"<|LOC_123|>": 100420,
|
| 43 |
+
"<|LOC_124|>": 100421,
|
| 44 |
+
"<|LOC_125|>": 100422,
|
| 45 |
+
"<|LOC_126|>": 100423,
|
| 46 |
+
"<|LOC_127|>": 100424,
|
| 47 |
+
"<|LOC_128|>": 100425,
|
| 48 |
+
"<|LOC_129|>": 100426,
|
| 49 |
+
"<|LOC_12|>": 100309,
|
| 50 |
+
"<|LOC_130|>": 100427,
|
| 51 |
+
"<|LOC_131|>": 100428,
|
| 52 |
+
"<|LOC_132|>": 100429,
|
| 53 |
+
"<|LOC_133|>": 100430,
|
| 54 |
+
"<|LOC_134|>": 100431,
|
| 55 |
+
"<|LOC_135|>": 100432,
|
| 56 |
+
"<|LOC_136|>": 100433,
|
| 57 |
+
"<|LOC_137|>": 100434,
|
| 58 |
+
"<|LOC_138|>": 100435,
|
| 59 |
+
"<|LOC_139|>": 100436,
|
| 60 |
+
"<|LOC_13|>": 100310,
|
| 61 |
+
"<|LOC_140|>": 100437,
|
| 62 |
+
"<|LOC_141|>": 100438,
|
| 63 |
+
"<|LOC_142|>": 100439,
|
| 64 |
+
"<|LOC_143|>": 100440,
|
| 65 |
+
"<|LOC_144|>": 100441,
|
| 66 |
+
"<|LOC_145|>": 100442,
|
| 67 |
+
"<|LOC_146|>": 100443,
|
| 68 |
+
"<|LOC_147|>": 100444,
|
| 69 |
+
"<|LOC_148|>": 100445,
|
| 70 |
+
"<|LOC_149|>": 100446,
|
| 71 |
+
"<|LOC_14|>": 100311,
|
| 72 |
+
"<|LOC_150|>": 100447,
|
| 73 |
+
"<|LOC_151|>": 100448,
|
| 74 |
+
"<|LOC_152|>": 100449,
|
| 75 |
+
"<|LOC_153|>": 100450,
|
| 76 |
+
"<|LOC_154|>": 100451,
|
| 77 |
+
"<|LOC_155|>": 100452,
|
| 78 |
+
"<|LOC_156|>": 100453,
|
| 79 |
+
"<|LOC_157|>": 100454,
|
| 80 |
+
"<|LOC_158|>": 100455,
|
| 81 |
+
"<|LOC_159|>": 100456,
|
| 82 |
+
"<|LOC_15|>": 100312,
|
| 83 |
+
"<|LOC_160|>": 100457,
|
| 84 |
+
"<|LOC_161|>": 100458,
|
| 85 |
+
"<|LOC_162|>": 100459,
|
| 86 |
+
"<|LOC_163|>": 100460,
|
| 87 |
+
"<|LOC_164|>": 100461,
|
| 88 |
+
"<|LOC_165|>": 100462,
|
| 89 |
+
"<|LOC_166|>": 100463,
|
| 90 |
+
"<|LOC_167|>": 100464,
|
| 91 |
+
"<|LOC_168|>": 100465,
|
| 92 |
+
"<|LOC_169|>": 100466,
|
| 93 |
+
"<|LOC_16|>": 100313,
|
| 94 |
+
"<|LOC_170|>": 100467,
|
| 95 |
+
"<|LOC_171|>": 100468,
|
| 96 |
+
"<|LOC_172|>": 100469,
|
| 97 |
+
"<|LOC_173|>": 100470,
|
| 98 |
+
"<|LOC_174|>": 100471,
|
| 99 |
+
"<|LOC_175|>": 100472,
|
| 100 |
+
"<|LOC_176|>": 100473,
|
| 101 |
+
"<|LOC_177|>": 100474,
|
| 102 |
+
"<|LOC_178|>": 100475,
|
| 103 |
+
"<|LOC_179|>": 100476,
|
| 104 |
+
"<|LOC_17|>": 100314,
|
| 105 |
+
"<|LOC_180|>": 100477,
|
| 106 |
+
"<|LOC_181|>": 100478,
|
| 107 |
+
"<|LOC_182|>": 100479,
|
| 108 |
+
"<|LOC_183|>": 100480,
|
| 109 |
+
"<|LOC_184|>": 100481,
|
| 110 |
+
"<|LOC_185|>": 100482,
|
| 111 |
+
"<|LOC_186|>": 100483,
|
| 112 |
+
"<|LOC_187|>": 100484,
|
| 113 |
+
"<|LOC_188|>": 100485,
|
| 114 |
+
"<|LOC_189|>": 100486,
|
| 115 |
+
"<|LOC_18|>": 100315,
|
| 116 |
+
"<|LOC_190|>": 100487,
|
| 117 |
+
"<|LOC_191|>": 100488,
|
| 118 |
+
"<|LOC_192|>": 100489,
|
| 119 |
+
"<|LOC_193|>": 100490,
|
| 120 |
+
"<|LOC_194|>": 100491,
|
| 121 |
+
"<|LOC_195|>": 100492,
|
| 122 |
+
"<|LOC_196|>": 100493,
|
| 123 |
+
"<|LOC_197|>": 100494,
|
| 124 |
+
"<|LOC_198|>": 100495,
|
| 125 |
+
"<|LOC_199|>": 100496,
|
| 126 |
+
"<|LOC_19|>": 100316,
|
| 127 |
+
"<|LOC_1|>": 100298,
|
| 128 |
+
"<|LOC_200|>": 100497,
|
| 129 |
+
"<|LOC_201|>": 100498,
|
| 130 |
+
"<|LOC_202|>": 100499,
|
| 131 |
+
"<|LOC_203|>": 100500,
|
| 132 |
+
"<|LOC_204|>": 100501,
|
| 133 |
+
"<|LOC_205|>": 100502,
|
| 134 |
+
"<|LOC_206|>": 100503,
|
| 135 |
+
"<|LOC_207|>": 100504,
|
| 136 |
+
"<|LOC_208|>": 100505,
|
| 137 |
+
"<|LOC_209|>": 100506,
|
| 138 |
+
"<|LOC_20|>": 100317,
|
| 139 |
+
"<|LOC_210|>": 100507,
|
| 140 |
+
"<|LOC_211|>": 100508,
|
| 141 |
+
"<|LOC_212|>": 100509,
|
| 142 |
+
"<|LOC_213|>": 100510,
|
| 143 |
+
"<|LOC_214|>": 100511,
|
| 144 |
+
"<|LOC_215|>": 100512,
|
| 145 |
+
"<|LOC_216|>": 100513,
|
| 146 |
+
"<|LOC_217|>": 100514,
|
| 147 |
+
"<|LOC_218|>": 100515,
|
| 148 |
+
"<|LOC_219|>": 100516,
|
| 149 |
+
"<|LOC_21|>": 100318,
|
| 150 |
+
"<|LOC_220|>": 100517,
|
| 151 |
+
"<|LOC_221|>": 100518,
|
| 152 |
+
"<|LOC_222|>": 100519,
|
| 153 |
+
"<|LOC_223|>": 100520,
|
| 154 |
+
"<|LOC_224|>": 100521,
|
| 155 |
+
"<|LOC_225|>": 100522,
|
| 156 |
+
"<|LOC_226|>": 100523,
|
| 157 |
+
"<|LOC_227|>": 100524,
|
| 158 |
+
"<|LOC_228|>": 100525,
|
| 159 |
+
"<|LOC_229|>": 100526,
|
| 160 |
+
"<|LOC_22|>": 100319,
|
| 161 |
+
"<|LOC_230|>": 100527,
|
| 162 |
+
"<|LOC_231|>": 100528,
|
| 163 |
+
"<|LOC_232|>": 100529,
|
| 164 |
+
"<|LOC_233|>": 100530,
|
| 165 |
+
"<|LOC_234|>": 100531,
|
| 166 |
+
"<|LOC_235|>": 100532,
|
| 167 |
+
"<|LOC_236|>": 100533,
|
| 168 |
+
"<|LOC_237|>": 100534,
|
| 169 |
+
"<|LOC_238|>": 100535,
|
| 170 |
+
"<|LOC_239|>": 100536,
|
| 171 |
+
"<|LOC_23|>": 100320,
|
| 172 |
+
"<|LOC_240|>": 100537,
|
| 173 |
+
"<|LOC_241|>": 100538,
|
| 174 |
+
"<|LOC_242|>": 100539,
|
| 175 |
+
"<|LOC_243|>": 100540,
|
| 176 |
+
"<|LOC_244|>": 100541,
|
| 177 |
+
"<|LOC_245|>": 100542,
|
| 178 |
+
"<|LOC_246|>": 100543,
|
| 179 |
+
"<|LOC_247|>": 100544,
|
| 180 |
+
"<|LOC_248|>": 100545,
|
| 181 |
+
"<|LOC_249|>": 100546,
|
| 182 |
+
"<|LOC_24|>": 100321,
|
| 183 |
+
"<|LOC_250|>": 100547,
|
| 184 |
+
"<|LOC_251|>": 100548,
|
| 185 |
+
"<|LOC_252|>": 100549,
|
| 186 |
+
"<|LOC_253|>": 100550,
|
| 187 |
+
"<|LOC_254|>": 100551,
|
| 188 |
+
"<|LOC_255|>": 100552,
|
| 189 |
+
"<|LOC_256|>": 100553,
|
| 190 |
+
"<|LOC_257|>": 100554,
|
| 191 |
+
"<|LOC_258|>": 100555,
|
| 192 |
+
"<|LOC_259|>": 100556,
|
| 193 |
+
"<|LOC_25|>": 100322,
|
| 194 |
+
"<|LOC_260|>": 100557,
|
| 195 |
+
"<|LOC_261|>": 100558,
|
| 196 |
+
"<|LOC_262|>": 100559,
|
| 197 |
+
"<|LOC_263|>": 100560,
|
| 198 |
+
"<|LOC_264|>": 100561,
|
| 199 |
+
"<|LOC_265|>": 100562,
|
| 200 |
+
"<|LOC_266|>": 100563,
|
| 201 |
+
"<|LOC_267|>": 100564,
|
| 202 |
+
"<|LOC_268|>": 100565,
|
| 203 |
+
"<|LOC_269|>": 100566,
|
| 204 |
+
"<|LOC_26|>": 100323,
|
| 205 |
+
"<|LOC_270|>": 100567,
|
| 206 |
+
"<|LOC_271|>": 100568,
|
| 207 |
+
"<|LOC_272|>": 100569,
|
| 208 |
+
"<|LOC_273|>": 100570,
|
| 209 |
+
"<|LOC_274|>": 100571,
|
| 210 |
+
"<|LOC_275|>": 100572,
|
| 211 |
+
"<|LOC_276|>": 100573,
|
| 212 |
+
"<|LOC_277|>": 100574,
|
| 213 |
+
"<|LOC_278|>": 100575,
|
| 214 |
+
"<|LOC_279|>": 100576,
|
| 215 |
+
"<|LOC_27|>": 100324,
|
| 216 |
+
"<|LOC_280|>": 100577,
|
| 217 |
+
"<|LOC_281|>": 100578,
|
| 218 |
+
"<|LOC_282|>": 100579,
|
| 219 |
+
"<|LOC_283|>": 100580,
|
| 220 |
+
"<|LOC_284|>": 100581,
|
| 221 |
+
"<|LOC_285|>": 100582,
|
| 222 |
+
"<|LOC_286|>": 100583,
|
| 223 |
+
"<|LOC_287|>": 100584,
|
| 224 |
+
"<|LOC_288|>": 100585,
|
| 225 |
+
"<|LOC_289|>": 100586,
|
| 226 |
+
"<|LOC_28|>": 100325,
|
| 227 |
+
"<|LOC_290|>": 100587,
|
| 228 |
+
"<|LOC_291|>": 100588,
|
| 229 |
+
"<|LOC_292|>": 100589,
|
| 230 |
+
"<|LOC_293|>": 100590,
|
| 231 |
+
"<|LOC_294|>": 100591,
|
| 232 |
+
"<|LOC_295|>": 100592,
|
| 233 |
+
"<|LOC_296|>": 100593,
|
| 234 |
+
"<|LOC_297|>": 100594,
|
| 235 |
+
"<|LOC_298|>": 100595,
|
| 236 |
+
"<|LOC_299|>": 100596,
|
| 237 |
+
"<|LOC_29|>": 100326,
|
| 238 |
+
"<|LOC_2|>": 100299,
|
| 239 |
+
"<|LOC_300|>": 100597,
|
| 240 |
+
"<|LOC_301|>": 100598,
|
| 241 |
+
"<|LOC_302|>": 100599,
|
| 242 |
+
"<|LOC_303|>": 100600,
|
| 243 |
+
"<|LOC_304|>": 100601,
|
| 244 |
+
"<|LOC_305|>": 100602,
|
| 245 |
+
"<|LOC_306|>": 100603,
|
| 246 |
+
"<|LOC_307|>": 100604,
|
| 247 |
+
"<|LOC_308|>": 100605,
|
| 248 |
+
"<|LOC_309|>": 100606,
|
| 249 |
+
"<|LOC_30|>": 100327,
|
| 250 |
+
"<|LOC_310|>": 100607,
|
| 251 |
+
"<|LOC_311|>": 100608,
|
| 252 |
+
"<|LOC_312|>": 100609,
|
| 253 |
+
"<|LOC_313|>": 100610,
|
| 254 |
+
"<|LOC_314|>": 100611,
|
| 255 |
+
"<|LOC_315|>": 100612,
|
| 256 |
+
"<|LOC_316|>": 100613,
|
| 257 |
+
"<|LOC_317|>": 100614,
|
| 258 |
+
"<|LOC_318|>": 100615,
|
| 259 |
+
"<|LOC_319|>": 100616,
|
| 260 |
+
"<|LOC_31|>": 100328,
|
| 261 |
+
"<|LOC_320|>": 100617,
|
| 262 |
+
"<|LOC_321|>": 100618,
|
| 263 |
+
"<|LOC_322|>": 100619,
|
| 264 |
+
"<|LOC_323|>": 100620,
|
| 265 |
+
"<|LOC_324|>": 100621,
|
| 266 |
+
"<|LOC_325|>": 100622,
|
| 267 |
+
"<|LOC_326|>": 100623,
|
| 268 |
+
"<|LOC_327|>": 100624,
|
| 269 |
+
"<|LOC_328|>": 100625,
|
| 270 |
+
"<|LOC_329|>": 100626,
|
| 271 |
+
"<|LOC_32|>": 100329,
|
| 272 |
+
"<|LOC_330|>": 100627,
|
| 273 |
+
"<|LOC_331|>": 100628,
|
| 274 |
+
"<|LOC_332|>": 100629,
|
| 275 |
+
"<|LOC_333|>": 100630,
|
| 276 |
+
"<|LOC_334|>": 100631,
|
| 277 |
+
"<|LOC_335|>": 100632,
|
| 278 |
+
"<|LOC_336|>": 100633,
|
| 279 |
+
"<|LOC_337|>": 100634,
|
| 280 |
+
"<|LOC_338|>": 100635,
|
| 281 |
+
"<|LOC_339|>": 100636,
|
| 282 |
+
"<|LOC_33|>": 100330,
|
| 283 |
+
"<|LOC_340|>": 100637,
|
| 284 |
+
"<|LOC_341|>": 100638,
|
| 285 |
+
"<|LOC_342|>": 100639,
|
| 286 |
+
"<|LOC_343|>": 100640,
|
| 287 |
+
"<|LOC_344|>": 100641,
|
| 288 |
+
"<|LOC_345|>": 100642,
|
| 289 |
+
"<|LOC_346|>": 100643,
|
| 290 |
+
"<|LOC_347|>": 100644,
|
| 291 |
+
"<|LOC_348|>": 100645,
|
| 292 |
+
"<|LOC_349|>": 100646,
|
| 293 |
+
"<|LOC_34|>": 100331,
|
| 294 |
+
"<|LOC_350|>": 100647,
|
| 295 |
+
"<|LOC_351|>": 100648,
|
| 296 |
+
"<|LOC_352|>": 100649,
|
| 297 |
+
"<|LOC_353|>": 100650,
|
| 298 |
+
"<|LOC_354|>": 100651,
|
| 299 |
+
"<|LOC_355|>": 100652,
|
| 300 |
+
"<|LOC_356|>": 100653,
|
| 301 |
+
"<|LOC_357|>": 100654,
|
| 302 |
+
"<|LOC_358|>": 100655,
|
| 303 |
+
"<|LOC_359|>": 100656,
|
| 304 |
+
"<|LOC_35|>": 100332,
|
| 305 |
+
"<|LOC_360|>": 100657,
|
| 306 |
+
"<|LOC_361|>": 100658,
|
| 307 |
+
"<|LOC_362|>": 100659,
|
| 308 |
+
"<|LOC_363|>": 100660,
|
| 309 |
+
"<|LOC_364|>": 100661,
|
| 310 |
+
"<|LOC_365|>": 100662,
|
| 311 |
+
"<|LOC_366|>": 100663,
|
| 312 |
+
"<|LOC_367|>": 100664,
|
| 313 |
+
"<|LOC_368|>": 100665,
|
| 314 |
+
"<|LOC_369|>": 100666,
|
| 315 |
+
"<|LOC_36|>": 100333,
|
| 316 |
+
"<|LOC_370|>": 100667,
|
| 317 |
+
"<|LOC_371|>": 100668,
|
| 318 |
+
"<|LOC_372|>": 100669,
|
| 319 |
+
"<|LOC_373|>": 100670,
|
| 320 |
+
"<|LOC_374|>": 100671,
|
| 321 |
+
"<|LOC_375|>": 100672,
|
| 322 |
+
"<|LOC_376|>": 100673,
|
| 323 |
+
"<|LOC_377|>": 100674,
|
| 324 |
+
"<|LOC_378|>": 100675,
|
| 325 |
+
"<|LOC_379|>": 100676,
|
| 326 |
+
"<|LOC_37|>": 100334,
|
| 327 |
+
"<|LOC_380|>": 100677,
|
| 328 |
+
"<|LOC_381|>": 100678,
|
| 329 |
+
"<|LOC_382|>": 100679,
|
| 330 |
+
"<|LOC_383|>": 100680,
|
| 331 |
+
"<|LOC_384|>": 100681,
|
| 332 |
+
"<|LOC_385|>": 100682,
|
| 333 |
+
"<|LOC_386|>": 100683,
|
| 334 |
+
"<|LOC_387|>": 100684,
|
| 335 |
+
"<|LOC_388|>": 100685,
|
| 336 |
+
"<|LOC_389|>": 100686,
|
| 337 |
+
"<|LOC_38|>": 100335,
|
| 338 |
+
"<|LOC_390|>": 100687,
|
| 339 |
+
"<|LOC_391|>": 100688,
|
| 340 |
+
"<|LOC_392|>": 100689,
|
| 341 |
+
"<|LOC_393|>": 100690,
|
| 342 |
+
"<|LOC_394|>": 100691,
|
| 343 |
+
"<|LOC_395|>": 100692,
|
| 344 |
+
"<|LOC_396|>": 100693,
|
| 345 |
+
"<|LOC_397|>": 100694,
|
| 346 |
+
"<|LOC_398|>": 100695,
|
| 347 |
+
"<|LOC_399|>": 100696,
|
| 348 |
+
"<|LOC_39|>": 100336,
|
| 349 |
+
"<|LOC_3|>": 100300,
|
| 350 |
+
"<|LOC_400|>": 100697,
|
| 351 |
+
"<|LOC_401|>": 100698,
|
| 352 |
+
"<|LOC_402|>": 100699,
|
| 353 |
+
"<|LOC_403|>": 100700,
|
| 354 |
+
"<|LOC_404|>": 100701,
|
| 355 |
+
"<|LOC_405|>": 100702,
|
| 356 |
+
"<|LOC_406|>": 100703,
|
| 357 |
+
"<|LOC_407|>": 100704,
|
| 358 |
+
"<|LOC_408|>": 100705,
|
| 359 |
+
"<|LOC_409|>": 100706,
|
| 360 |
+
"<|LOC_40|>": 100337,
|
| 361 |
+
"<|LOC_410|>": 100707,
|
| 362 |
+
"<|LOC_411|>": 100708,
|
| 363 |
+
"<|LOC_412|>": 100709,
|
| 364 |
+
"<|LOC_413|>": 100710,
|
| 365 |
+
"<|LOC_414|>": 100711,
|
| 366 |
+
"<|LOC_415|>": 100712,
|
| 367 |
+
"<|LOC_416|>": 100713,
|
| 368 |
+
"<|LOC_417|>": 100714,
|
| 369 |
+
"<|LOC_418|>": 100715,
|
| 370 |
+
"<|LOC_419|>": 100716,
|
| 371 |
+
"<|LOC_41|>": 100338,
|
| 372 |
+
"<|LOC_420|>": 100717,
|
| 373 |
+
"<|LOC_421|>": 100718,
|
| 374 |
+
"<|LOC_422|>": 100719,
|
| 375 |
+
"<|LOC_423|>": 100720,
|
| 376 |
+
"<|LOC_424|>": 100721,
|
| 377 |
+
"<|LOC_425|>": 100722,
|
| 378 |
+
"<|LOC_426|>": 100723,
|
| 379 |
+
"<|LOC_427|>": 100724,
|
| 380 |
+
"<|LOC_428|>": 100725,
|
| 381 |
+
"<|LOC_429|>": 100726,
|
| 382 |
+
"<|LOC_42|>": 100339,
|
| 383 |
+
"<|LOC_430|>": 100727,
|
| 384 |
+
"<|LOC_431|>": 100728,
|
| 385 |
+
"<|LOC_432|>": 100729,
|
| 386 |
+
"<|LOC_433|>": 100730,
|
| 387 |
+
"<|LOC_434|>": 100731,
|
| 388 |
+
"<|LOC_435|>": 100732,
|
| 389 |
+
"<|LOC_436|>": 100733,
|
| 390 |
+
"<|LOC_437|>": 100734,
|
| 391 |
+
"<|LOC_438|>": 100735,
|
| 392 |
+
"<|LOC_439|>": 100736,
|
| 393 |
+
"<|LOC_43|>": 100340,
|
| 394 |
+
"<|LOC_440|>": 100737,
|
| 395 |
+
"<|LOC_441|>": 100738,
|
| 396 |
+
"<|LOC_442|>": 100739,
|
| 397 |
+
"<|LOC_443|>": 100740,
|
| 398 |
+
"<|LOC_444|>": 100741,
|
| 399 |
+
"<|LOC_445|>": 100742,
|
| 400 |
+
"<|LOC_446|>": 100743,
|
| 401 |
+
"<|LOC_447|>": 100744,
|
| 402 |
+
"<|LOC_448|>": 100745,
|
| 403 |
+
"<|LOC_449|>": 100746,
|
| 404 |
+
"<|LOC_44|>": 100341,
|
| 405 |
+
"<|LOC_450|>": 100747,
|
| 406 |
+
"<|LOC_451|>": 100748,
|
| 407 |
+
"<|LOC_452|>": 100749,
|
| 408 |
+
"<|LOC_453|>": 100750,
|
| 409 |
+
"<|LOC_454|>": 100751,
|
| 410 |
+
"<|LOC_455|>": 100752,
|
| 411 |
+
"<|LOC_456|>": 100753,
|
| 412 |
+
"<|LOC_457|>": 100754,
|
| 413 |
+
"<|LOC_458|>": 100755,
|
| 414 |
+
"<|LOC_459|>": 100756,
|
| 415 |
+
"<|LOC_45|>": 100342,
|
| 416 |
+
"<|LOC_460|>": 100757,
|
| 417 |
+
"<|LOC_461|>": 100758,
|
| 418 |
+
"<|LOC_462|>": 100759,
|
| 419 |
+
"<|LOC_463|>": 100760,
|
| 420 |
+
"<|LOC_464|>": 100761,
|
| 421 |
+
"<|LOC_465|>": 100762,
|
| 422 |
+
"<|LOC_466|>": 100763,
|
| 423 |
+
"<|LOC_467|>": 100764,
|
| 424 |
+
"<|LOC_468|>": 100765,
|
| 425 |
+
"<|LOC_469|>": 100766,
|
| 426 |
+
"<|LOC_46|>": 100343,
|
| 427 |
+
"<|LOC_470|>": 100767,
|
| 428 |
+
"<|LOC_471|>": 100768,
|
| 429 |
+
"<|LOC_472|>": 100769,
|
| 430 |
+
"<|LOC_473|>": 100770,
|
| 431 |
+
"<|LOC_474|>": 100771,
|
| 432 |
+
"<|LOC_475|>": 100772,
|
| 433 |
+
"<|LOC_476|>": 100773,
|
| 434 |
+
"<|LOC_477|>": 100774,
|
| 435 |
+
"<|LOC_478|>": 100775,
|
| 436 |
+
"<|LOC_479|>": 100776,
|
| 437 |
+
"<|LOC_47|>": 100344,
|
| 438 |
+
"<|LOC_480|>": 100777,
|
| 439 |
+
"<|LOC_481|>": 100778,
|
| 440 |
+
"<|LOC_482|>": 100779,
|
| 441 |
+
"<|LOC_483|>": 100780,
|
| 442 |
+
"<|LOC_484|>": 100781,
|
| 443 |
+
"<|LOC_485|>": 100782,
|
| 444 |
+
"<|LOC_486|>": 100783,
|
| 445 |
+
"<|LOC_487|>": 100784,
|
| 446 |
+
"<|LOC_488|>": 100785,
|
| 447 |
+
"<|LOC_489|>": 100786,
|
| 448 |
+
"<|LOC_48|>": 100345,
|
| 449 |
+
"<|LOC_490|>": 100787,
|
| 450 |
+
"<|LOC_491|>": 100788,
|
| 451 |
+
"<|LOC_492|>": 100789,
|
| 452 |
+
"<|LOC_493|>": 100790,
|
| 453 |
+
"<|LOC_494|>": 100791,
|
| 454 |
+
"<|LOC_495|>": 100792,
|
| 455 |
+
"<|LOC_496|>": 100793,
|
| 456 |
+
"<|LOC_497|>": 100794,
|
| 457 |
+
"<|LOC_498|>": 100795,
|
| 458 |
+
"<|LOC_499|>": 100796,
|
| 459 |
+
"<|LOC_49|>": 100346,
|
| 460 |
+
"<|LOC_4|>": 100301,
|
| 461 |
+
"<|LOC_500|>": 100797,
|
| 462 |
+
"<|LOC_501|>": 100798,
|
| 463 |
+
"<|LOC_502|>": 100799,
|
| 464 |
+
"<|LOC_503|>": 100800,
|
| 465 |
+
"<|LOC_504|>": 100801,
|
| 466 |
+
"<|LOC_505|>": 100802,
|
| 467 |
+
"<|LOC_506|>": 100803,
|
| 468 |
+
"<|LOC_507|>": 100804,
|
| 469 |
+
"<|LOC_508|>": 100805,
|
| 470 |
+
"<|LOC_509|>": 100806,
|
| 471 |
+
"<|LOC_50|>": 100347,
|
| 472 |
+
"<|LOC_510|>": 100807,
|
| 473 |
+
"<|LOC_511|>": 100808,
|
| 474 |
+
"<|LOC_512|>": 100809,
|
| 475 |
+
"<|LOC_513|>": 100810,
|
| 476 |
+
"<|LOC_514|>": 100811,
|
| 477 |
+
"<|LOC_515|>": 100812,
|
| 478 |
+
"<|LOC_516|>": 100813,
|
| 479 |
+
"<|LOC_517|>": 100814,
|
| 480 |
+
"<|LOC_518|>": 100815,
|
| 481 |
+
"<|LOC_519|>": 100816,
|
| 482 |
+
"<|LOC_51|>": 100348,
|
| 483 |
+
"<|LOC_520|>": 100817,
|
| 484 |
+
"<|LOC_521|>": 100818,
|
| 485 |
+
"<|LOC_522|>": 100819,
|
| 486 |
+
"<|LOC_523|>": 100820,
|
| 487 |
+
"<|LOC_524|>": 100821,
|
| 488 |
+
"<|LOC_525|>": 100822,
|
| 489 |
+
"<|LOC_526|>": 100823,
|
| 490 |
+
"<|LOC_527|>": 100824,
|
| 491 |
+
"<|LOC_528|>": 100825,
|
| 492 |
+
"<|LOC_529|>": 100826,
|
| 493 |
+
"<|LOC_52|>": 100349,
|
| 494 |
+
"<|LOC_530|>": 100827,
|
| 495 |
+
"<|LOC_531|>": 100828,
|
| 496 |
+
"<|LOC_532|>": 100829,
|
| 497 |
+
"<|LOC_533|>": 100830,
|
| 498 |
+
"<|LOC_534|>": 100831,
|
| 499 |
+
"<|LOC_535|>": 100832,
|
| 500 |
+
"<|LOC_536|>": 100833,
|
| 501 |
+
"<|LOC_537|>": 100834,
|
| 502 |
+
"<|LOC_538|>": 100835,
|
| 503 |
+
"<|LOC_539|>": 100836,
|
| 504 |
+
"<|LOC_53|>": 100350,
|
| 505 |
+
"<|LOC_540|>": 100837,
|
| 506 |
+
"<|LOC_541|>": 100838,
|
| 507 |
+
"<|LOC_542|>": 100839,
|
| 508 |
+
"<|LOC_543|>": 100840,
|
| 509 |
+
"<|LOC_544|>": 100841,
|
| 510 |
+
"<|LOC_545|>": 100842,
|
| 511 |
+
"<|LOC_546|>": 100843,
|
| 512 |
+
"<|LOC_547|>": 100844,
|
| 513 |
+
"<|LOC_548|>": 100845,
|
| 514 |
+
"<|LOC_549|>": 100846,
|
| 515 |
+
"<|LOC_54|>": 100351,
|
| 516 |
+
"<|LOC_550|>": 100847,
|
| 517 |
+
"<|LOC_551|>": 100848,
|
| 518 |
+
"<|LOC_552|>": 100849,
|
| 519 |
+
"<|LOC_553|>": 100850,
|
| 520 |
+
"<|LOC_554|>": 100851,
|
| 521 |
+
"<|LOC_555|>": 100852,
|
| 522 |
+
"<|LOC_556|>": 100853,
|
| 523 |
+
"<|LOC_557|>": 100854,
|
| 524 |
+
"<|LOC_558|>": 100855,
|
| 525 |
+
"<|LOC_559|>": 100856,
|
| 526 |
+
"<|LOC_55|>": 100352,
|
| 527 |
+
"<|LOC_560|>": 100857,
|
| 528 |
+
"<|LOC_561|>": 100858,
|
| 529 |
+
"<|LOC_562|>": 100859,
|
| 530 |
+
"<|LOC_563|>": 100860,
|
| 531 |
+
"<|LOC_564|>": 100861,
|
| 532 |
+
"<|LOC_565|>": 100862,
|
| 533 |
+
"<|LOC_566|>": 100863,
|
| 534 |
+
"<|LOC_567|>": 100864,
|
| 535 |
+
"<|LOC_568|>": 100865,
|
| 536 |
+
"<|LOC_569|>": 100866,
|
| 537 |
+
"<|LOC_56|>": 100353,
|
| 538 |
+
"<|LOC_570|>": 100867,
|
| 539 |
+
"<|LOC_571|>": 100868,
|
| 540 |
+
"<|LOC_572|>": 100869,
|
| 541 |
+
"<|LOC_573|>": 100870,
|
| 542 |
+
"<|LOC_574|>": 100871,
|
| 543 |
+
"<|LOC_575|>": 100872,
|
| 544 |
+
"<|LOC_576|>": 100873,
|
| 545 |
+
"<|LOC_577|>": 100874,
|
| 546 |
+
"<|LOC_578|>": 100875,
|
| 547 |
+
"<|LOC_579|>": 100876,
|
| 548 |
+
"<|LOC_57|>": 100354,
|
| 549 |
+
"<|LOC_580|>": 100877,
|
| 550 |
+
"<|LOC_581|>": 100878,
|
| 551 |
+
"<|LOC_582|>": 100879,
|
| 552 |
+
"<|LOC_583|>": 100880,
|
| 553 |
+
"<|LOC_584|>": 100881,
|
| 554 |
+
"<|LOC_585|>": 100882,
|
| 555 |
+
"<|LOC_586|>": 100883,
|
| 556 |
+
"<|LOC_587|>": 100884,
|
| 557 |
+
"<|LOC_588|>": 100885,
|
| 558 |
+
"<|LOC_589|>": 100886,
|
| 559 |
+
"<|LOC_58|>": 100355,
|
| 560 |
+
"<|LOC_590|>": 100887,
|
| 561 |
+
"<|LOC_591|>": 100888,
|
| 562 |
+
"<|LOC_592|>": 100889,
|
| 563 |
+
"<|LOC_593|>": 100890,
|
| 564 |
+
"<|LOC_594|>": 100891,
|
| 565 |
+
"<|LOC_595|>": 100892,
|
| 566 |
+
"<|LOC_596|>": 100893,
|
| 567 |
+
"<|LOC_597|>": 100894,
|
| 568 |
+
"<|LOC_598|>": 100895,
|
| 569 |
+
"<|LOC_599|>": 100896,
|
| 570 |
+
"<|LOC_59|>": 100356,
|
| 571 |
+
"<|LOC_5|>": 100302,
|
| 572 |
+
"<|LOC_600|>": 100897,
|
| 573 |
+
"<|LOC_601|>": 100898,
|
| 574 |
+
"<|LOC_602|>": 100899,
|
| 575 |
+
"<|LOC_603|>": 100900,
|
| 576 |
+
"<|LOC_604|>": 100901,
|
| 577 |
+
"<|LOC_605|>": 100902,
|
| 578 |
+
"<|LOC_606|>": 100903,
|
| 579 |
+
"<|LOC_607|>": 100904,
|
| 580 |
+
"<|LOC_608|>": 100905,
|
| 581 |
+
"<|LOC_609|>": 100906,
|
| 582 |
+
"<|LOC_60|>": 100357,
|
| 583 |
+
"<|LOC_610|>": 100907,
|
| 584 |
+
"<|LOC_611|>": 100908,
|
| 585 |
+
"<|LOC_612|>": 100909,
|
| 586 |
+
"<|LOC_613|>": 100910,
|
| 587 |
+
"<|LOC_614|>": 100911,
|
| 588 |
+
"<|LOC_615|>": 100912,
|
| 589 |
+
"<|LOC_616|>": 100913,
|
| 590 |
+
"<|LOC_617|>": 100914,
|
| 591 |
+
"<|LOC_618|>": 100915,
|
| 592 |
+
"<|LOC_619|>": 100916,
|
| 593 |
+
"<|LOC_61|>": 100358,
|
| 594 |
+
"<|LOC_620|>": 100917,
|
| 595 |
+
"<|LOC_621|>": 100918,
|
| 596 |
+
"<|LOC_622|>": 100919,
|
| 597 |
+
"<|LOC_623|>": 100920,
|
| 598 |
+
"<|LOC_624|>": 100921,
|
| 599 |
+
"<|LOC_625|>": 100922,
|
| 600 |
+
"<|LOC_626|>": 100923,
|
| 601 |
+
"<|LOC_627|>": 100924,
|
| 602 |
+
"<|LOC_628|>": 100925,
|
| 603 |
+
"<|LOC_629|>": 100926,
|
| 604 |
+
"<|LOC_62|>": 100359,
|
| 605 |
+
"<|LOC_630|>": 100927,
|
| 606 |
+
"<|LOC_631|>": 100928,
|
| 607 |
+
"<|LOC_632|>": 100929,
|
| 608 |
+
"<|LOC_633|>": 100930,
|
| 609 |
+
"<|LOC_634|>": 100931,
|
| 610 |
+
"<|LOC_635|>": 100932,
|
| 611 |
+
"<|LOC_636|>": 100933,
|
| 612 |
+
"<|LOC_637|>": 100934,
|
| 613 |
+
"<|LOC_638|>": 100935,
|
| 614 |
+
"<|LOC_639|>": 100936,
|
| 615 |
+
"<|LOC_63|>": 100360,
|
| 616 |
+
"<|LOC_640|>": 100937,
|
| 617 |
+
"<|LOC_641|>": 100938,
|
| 618 |
+
"<|LOC_642|>": 100939,
|
| 619 |
+
"<|LOC_643|>": 100940,
|
| 620 |
+
"<|LOC_644|>": 100941,
|
| 621 |
+
"<|LOC_645|>": 100942,
|
| 622 |
+
"<|LOC_646|>": 100943,
|
| 623 |
+
"<|LOC_647|>": 100944,
|
| 624 |
+
"<|LOC_648|>": 100945,
|
| 625 |
+
"<|LOC_649|>": 100946,
|
| 626 |
+
"<|LOC_64|>": 100361,
|
| 627 |
+
"<|LOC_650|>": 100947,
|
| 628 |
+
"<|LOC_651|>": 100948,
|
| 629 |
+
"<|LOC_652|>": 100949,
|
| 630 |
+
"<|LOC_653|>": 100950,
|
| 631 |
+
"<|LOC_654|>": 100951,
|
| 632 |
+
"<|LOC_655|>": 100952,
|
| 633 |
+
"<|LOC_656|>": 100953,
|
| 634 |
+
"<|LOC_657|>": 100954,
|
| 635 |
+
"<|LOC_658|>": 100955,
|
| 636 |
+
"<|LOC_659|>": 100956,
|
| 637 |
+
"<|LOC_65|>": 100362,
|
| 638 |
+
"<|LOC_660|>": 100957,
|
| 639 |
+
"<|LOC_661|>": 100958,
|
| 640 |
+
"<|LOC_662|>": 100959,
|
| 641 |
+
"<|LOC_663|>": 100960,
|
| 642 |
+
"<|LOC_664|>": 100961,
|
| 643 |
+
"<|LOC_665|>": 100962,
|
| 644 |
+
"<|LOC_666|>": 100963,
|
| 645 |
+
"<|LOC_667|>": 100964,
|
| 646 |
+
"<|LOC_668|>": 100965,
|
| 647 |
+
"<|LOC_669|>": 100966,
|
| 648 |
+
"<|LOC_66|>": 100363,
|
| 649 |
+
"<|LOC_670|>": 100967,
|
| 650 |
+
"<|LOC_671|>": 100968,
|
| 651 |
+
"<|LOC_672|>": 100969,
|
| 652 |
+
"<|LOC_673|>": 100970,
|
| 653 |
+
"<|LOC_674|>": 100971,
|
| 654 |
+
"<|LOC_675|>": 100972,
|
| 655 |
+
"<|LOC_676|>": 100973,
|
| 656 |
+
"<|LOC_677|>": 100974,
|
| 657 |
+
"<|LOC_678|>": 100975,
|
| 658 |
+
"<|LOC_679|>": 100976,
|
| 659 |
+
"<|LOC_67|>": 100364,
|
| 660 |
+
"<|LOC_680|>": 100977,
|
| 661 |
+
"<|LOC_681|>": 100978,
|
| 662 |
+
"<|LOC_682|>": 100979,
|
| 663 |
+
"<|LOC_683|>": 100980,
|
| 664 |
+
"<|LOC_684|>": 100981,
|
| 665 |
+
"<|LOC_685|>": 100982,
|
| 666 |
+
"<|LOC_686|>": 100983,
|
| 667 |
+
"<|LOC_687|>": 100984,
|
| 668 |
+
"<|LOC_688|>": 100985,
|
| 669 |
+
"<|LOC_689|>": 100986,
|
| 670 |
+
"<|LOC_68|>": 100365,
|
| 671 |
+
"<|LOC_690|>": 100987,
|
| 672 |
+
"<|LOC_691|>": 100988,
|
| 673 |
+
"<|LOC_692|>": 100989,
|
| 674 |
+
"<|LOC_693|>": 100990,
|
| 675 |
+
"<|LOC_694|>": 100991,
|
| 676 |
+
"<|LOC_695|>": 100992,
|
| 677 |
+
"<|LOC_696|>": 100993,
|
| 678 |
+
"<|LOC_697|>": 100994,
|
| 679 |
+
"<|LOC_698|>": 100995,
|
| 680 |
+
"<|LOC_699|>": 100996,
|
| 681 |
+
"<|LOC_69|>": 100366,
|
| 682 |
+
"<|LOC_6|>": 100303,
|
| 683 |
+
"<|LOC_700|>": 100997,
|
| 684 |
+
"<|LOC_701|>": 100998,
|
| 685 |
+
"<|LOC_702|>": 100999,
|
| 686 |
+
"<|LOC_703|>": 101000,
|
| 687 |
+
"<|LOC_704|>": 101001,
|
| 688 |
+
"<|LOC_705|>": 101002,
|
| 689 |
+
"<|LOC_706|>": 101003,
|
| 690 |
+
"<|LOC_707|>": 101004,
|
| 691 |
+
"<|LOC_708|>": 101005,
|
| 692 |
+
"<|LOC_709|>": 101006,
|
| 693 |
+
"<|LOC_70|>": 100367,
|
| 694 |
+
"<|LOC_710|>": 101007,
|
| 695 |
+
"<|LOC_711|>": 101008,
|
| 696 |
+
"<|LOC_712|>": 101009,
|
| 697 |
+
"<|LOC_713|>": 101010,
|
| 698 |
+
"<|LOC_714|>": 101011,
|
| 699 |
+
"<|LOC_715|>": 101012,
|
| 700 |
+
"<|LOC_716|>": 101013,
|
| 701 |
+
"<|LOC_717|>": 101014,
|
| 702 |
+
"<|LOC_718|>": 101015,
|
| 703 |
+
"<|LOC_719|>": 101016,
|
| 704 |
+
"<|LOC_71|>": 100368,
|
| 705 |
+
"<|LOC_720|>": 101017,
|
| 706 |
+
"<|LOC_721|>": 101018,
|
| 707 |
+
"<|LOC_722|>": 101019,
|
| 708 |
+
"<|LOC_723|>": 101020,
|
| 709 |
+
"<|LOC_724|>": 101021,
|
| 710 |
+
"<|LOC_725|>": 101022,
|
| 711 |
+
"<|LOC_726|>": 101023,
|
| 712 |
+
"<|LOC_727|>": 101024,
|
| 713 |
+
"<|LOC_728|>": 101025,
|
| 714 |
+
"<|LOC_729|>": 101026,
|
| 715 |
+
"<|LOC_72|>": 100369,
|
| 716 |
+
"<|LOC_730|>": 101027,
|
| 717 |
+
"<|LOC_731|>": 101028,
|
| 718 |
+
"<|LOC_732|>": 101029,
|
| 719 |
+
"<|LOC_733|>": 101030,
|
| 720 |
+
"<|LOC_734|>": 101031,
|
| 721 |
+
"<|LOC_735|>": 101032,
|
| 722 |
+
"<|LOC_736|>": 101033,
|
| 723 |
+
"<|LOC_737|>": 101034,
|
| 724 |
+
"<|LOC_738|>": 101035,
|
| 725 |
+
"<|LOC_739|>": 101036,
|
| 726 |
+
"<|LOC_73|>": 100370,
|
| 727 |
+
"<|LOC_740|>": 101037,
|
| 728 |
+
"<|LOC_741|>": 101038,
|
| 729 |
+
"<|LOC_742|>": 101039,
|
| 730 |
+
"<|LOC_743|>": 101040,
|
| 731 |
+
"<|LOC_744|>": 101041,
|
| 732 |
+
"<|LOC_745|>": 101042,
|
| 733 |
+
"<|LOC_746|>": 101043,
|
| 734 |
+
"<|LOC_747|>": 101044,
|
| 735 |
+
"<|LOC_748|>": 101045,
|
| 736 |
+
"<|LOC_749|>": 101046,
|
| 737 |
+
"<|LOC_74|>": 100371,
|
| 738 |
+
"<|LOC_750|>": 101047,
|
| 739 |
+
"<|LOC_751|>": 101048,
|
| 740 |
+
"<|LOC_752|>": 101049,
|
| 741 |
+
"<|LOC_753|>": 101050,
|
| 742 |
+
"<|LOC_754|>": 101051,
|
| 743 |
+
"<|LOC_755|>": 101052,
|
| 744 |
+
"<|LOC_756|>": 101053,
|
| 745 |
+
"<|LOC_757|>": 101054,
|
| 746 |
+
"<|LOC_758|>": 101055,
|
| 747 |
+
"<|LOC_759|>": 101056,
|
| 748 |
+
"<|LOC_75|>": 100372,
|
| 749 |
+
"<|LOC_760|>": 101057,
|
| 750 |
+
"<|LOC_761|>": 101058,
|
| 751 |
+
"<|LOC_762|>": 101059,
|
| 752 |
+
"<|LOC_763|>": 101060,
|
| 753 |
+
"<|LOC_764|>": 101061,
|
| 754 |
+
"<|LOC_765|>": 101062,
|
| 755 |
+
"<|LOC_766|>": 101063,
|
| 756 |
+
"<|LOC_767|>": 101064,
|
| 757 |
+
"<|LOC_768|>": 101065,
|
| 758 |
+
"<|LOC_769|>": 101066,
|
| 759 |
+
"<|LOC_76|>": 100373,
|
| 760 |
+
"<|LOC_770|>": 101067,
|
| 761 |
+
"<|LOC_771|>": 101068,
|
| 762 |
+
"<|LOC_772|>": 101069,
|
| 763 |
+
"<|LOC_773|>": 101070,
|
| 764 |
+
"<|LOC_774|>": 101071,
|
| 765 |
+
"<|LOC_775|>": 101072,
|
| 766 |
+
"<|LOC_776|>": 101073,
|
| 767 |
+
"<|LOC_777|>": 101074,
|
| 768 |
+
"<|LOC_778|>": 101075,
|
| 769 |
+
"<|LOC_779|>": 101076,
|
| 770 |
+
"<|LOC_77|>": 100374,
|
| 771 |
+
"<|LOC_780|>": 101077,
|
| 772 |
+
"<|LOC_781|>": 101078,
|
| 773 |
+
"<|LOC_782|>": 101079,
|
| 774 |
+
"<|LOC_783|>": 101080,
|
| 775 |
+
"<|LOC_784|>": 101081,
|
| 776 |
+
"<|LOC_785|>": 101082,
|
| 777 |
+
"<|LOC_786|>": 101083,
|
| 778 |
+
"<|LOC_787|>": 101084,
|
| 779 |
+
"<|LOC_788|>": 101085,
|
| 780 |
+
"<|LOC_789|>": 101086,
|
| 781 |
+
"<|LOC_78|>": 100375,
|
| 782 |
+
"<|LOC_790|>": 101087,
|
| 783 |
+
"<|LOC_791|>": 101088,
|
| 784 |
+
"<|LOC_792|>": 101089,
|
| 785 |
+
"<|LOC_793|>": 101090,
|
| 786 |
+
"<|LOC_794|>": 101091,
|
| 787 |
+
"<|LOC_795|>": 101092,
|
| 788 |
+
"<|LOC_796|>": 101093,
|
| 789 |
+
"<|LOC_797|>": 101094,
|
| 790 |
+
"<|LOC_798|>": 101095,
|
| 791 |
+
"<|LOC_799|>": 101096,
|
| 792 |
+
"<|LOC_79|>": 100376,
|
| 793 |
+
"<|LOC_7|>": 100304,
|
| 794 |
+
"<|LOC_800|>": 101097,
|
| 795 |
+
"<|LOC_801|>": 101098,
|
| 796 |
+
"<|LOC_802|>": 101099,
|
| 797 |
+
"<|LOC_803|>": 101100,
|
| 798 |
+
"<|LOC_804|>": 101101,
|
| 799 |
+
"<|LOC_805|>": 101102,
|
| 800 |
+
"<|LOC_806|>": 101103,
|
| 801 |
+
"<|LOC_807|>": 101104,
|
| 802 |
+
"<|LOC_808|>": 101105,
|
| 803 |
+
"<|LOC_809|>": 101106,
|
| 804 |
+
"<|LOC_80|>": 100377,
|
| 805 |
+
"<|LOC_810|>": 101107,
|
| 806 |
+
"<|LOC_811|>": 101108,
|
| 807 |
+
"<|LOC_812|>": 101109,
|
| 808 |
+
"<|LOC_813|>": 101110,
|
| 809 |
+
"<|LOC_814|>": 101111,
|
| 810 |
+
"<|LOC_815|>": 101112,
|
| 811 |
+
"<|LOC_816|>": 101113,
|
| 812 |
+
"<|LOC_817|>": 101114,
|
| 813 |
+
"<|LOC_818|>": 101115,
|
| 814 |
+
"<|LOC_819|>": 101116,
|
| 815 |
+
"<|LOC_81|>": 100378,
|
| 816 |
+
"<|LOC_820|>": 101117,
|
| 817 |
+
"<|LOC_821|>": 101118,
|
| 818 |
+
"<|LOC_822|>": 101119,
|
| 819 |
+
"<|LOC_823|>": 101120,
|
| 820 |
+
"<|LOC_824|>": 101121,
|
| 821 |
+
"<|LOC_825|>": 101122,
|
| 822 |
+
"<|LOC_826|>": 101123,
|
| 823 |
+
"<|LOC_827|>": 101124,
|
| 824 |
+
"<|LOC_828|>": 101125,
|
| 825 |
+
"<|LOC_829|>": 101126,
|
| 826 |
+
"<|LOC_82|>": 100379,
|
| 827 |
+
"<|LOC_830|>": 101127,
|
| 828 |
+
"<|LOC_831|>": 101128,
|
| 829 |
+
"<|LOC_832|>": 101129,
|
| 830 |
+
"<|LOC_833|>": 101130,
|
| 831 |
+
"<|LOC_834|>": 101131,
|
| 832 |
+
"<|LOC_835|>": 101132,
|
| 833 |
+
"<|LOC_836|>": 101133,
|
| 834 |
+
"<|LOC_837|>": 101134,
|
| 835 |
+
"<|LOC_838|>": 101135,
|
| 836 |
+
"<|LOC_839|>": 101136,
|
| 837 |
+
"<|LOC_83|>": 100380,
|
| 838 |
+
"<|LOC_840|>": 101137,
|
| 839 |
+
"<|LOC_841|>": 101138,
|
| 840 |
+
"<|LOC_842|>": 101139,
|
| 841 |
+
"<|LOC_843|>": 101140,
|
| 842 |
+
"<|LOC_844|>": 101141,
|
| 843 |
+
"<|LOC_845|>": 101142,
|
| 844 |
+
"<|LOC_846|>": 101143,
|
| 845 |
+
"<|LOC_847|>": 101144,
|
| 846 |
+
"<|LOC_848|>": 101145,
|
| 847 |
+
"<|LOC_849|>": 101146,
|
| 848 |
+
"<|LOC_84|>": 100381,
|
| 849 |
+
"<|LOC_850|>": 101147,
|
| 850 |
+
"<|LOC_851|>": 101148,
|
| 851 |
+
"<|LOC_852|>": 101149,
|
| 852 |
+
"<|LOC_853|>": 101150,
|
| 853 |
+
"<|LOC_854|>": 101151,
|
| 854 |
+
"<|LOC_855|>": 101152,
|
| 855 |
+
"<|LOC_856|>": 101153,
|
| 856 |
+
"<|LOC_857|>": 101154,
|
| 857 |
+
"<|LOC_858|>": 101155,
|
| 858 |
+
"<|LOC_859|>": 101156,
|
| 859 |
+
"<|LOC_85|>": 100382,
|
| 860 |
+
"<|LOC_860|>": 101157,
|
| 861 |
+
"<|LOC_861|>": 101158,
|
| 862 |
+
"<|LOC_862|>": 101159,
|
| 863 |
+
"<|LOC_863|>": 101160,
|
| 864 |
+
"<|LOC_864|>": 101161,
|
| 865 |
+
"<|LOC_865|>": 101162,
|
| 866 |
+
"<|LOC_866|>": 101163,
|
| 867 |
+
"<|LOC_867|>": 101164,
|
| 868 |
+
"<|LOC_868|>": 101165,
|
| 869 |
+
"<|LOC_869|>": 101166,
|
| 870 |
+
"<|LOC_86|>": 100383,
|
| 871 |
+
"<|LOC_870|>": 101167,
|
| 872 |
+
"<|LOC_871|>": 101168,
|
| 873 |
+
"<|LOC_872|>": 101169,
|
| 874 |
+
"<|LOC_873|>": 101170,
|
| 875 |
+
"<|LOC_874|>": 101171,
|
| 876 |
+
"<|LOC_875|>": 101172,
|
| 877 |
+
"<|LOC_876|>": 101173,
|
| 878 |
+
"<|LOC_877|>": 101174,
|
| 879 |
+
"<|LOC_878|>": 101175,
|
| 880 |
+
"<|LOC_879|>": 101176,
|
| 881 |
+
"<|LOC_87|>": 100384,
|
| 882 |
+
"<|LOC_880|>": 101177,
|
| 883 |
+
"<|LOC_881|>": 101178,
|
| 884 |
+
"<|LOC_882|>": 101179,
|
| 885 |
+
"<|LOC_883|>": 101180,
|
| 886 |
+
"<|LOC_884|>": 101181,
|
| 887 |
+
"<|LOC_885|>": 101182,
|
| 888 |
+
"<|LOC_886|>": 101183,
|
| 889 |
+
"<|LOC_887|>": 101184,
|
| 890 |
+
"<|LOC_888|>": 101185,
|
| 891 |
+
"<|LOC_889|>": 101186,
|
| 892 |
+
"<|LOC_88|>": 100385,
|
| 893 |
+
"<|LOC_890|>": 101187,
|
| 894 |
+
"<|LOC_891|>": 101188,
|
| 895 |
+
"<|LOC_892|>": 101189,
|
| 896 |
+
"<|LOC_893|>": 101190,
|
| 897 |
+
"<|LOC_894|>": 101191,
|
| 898 |
+
"<|LOC_895|>": 101192,
|
| 899 |
+
"<|LOC_896|>": 101193,
|
| 900 |
+
"<|LOC_897|>": 101194,
|
| 901 |
+
"<|LOC_898|>": 101195,
|
| 902 |
+
"<|LOC_899|>": 101196,
|
| 903 |
+
"<|LOC_89|>": 100386,
|
| 904 |
+
"<|LOC_8|>": 100305,
|
| 905 |
+
"<|LOC_900|>": 101197,
|
| 906 |
+
"<|LOC_901|>": 101198,
|
| 907 |
+
"<|LOC_902|>": 101199,
|
| 908 |
+
"<|LOC_903|>": 101200,
|
| 909 |
+
"<|LOC_904|>": 101201,
|
| 910 |
+
"<|LOC_905|>": 101202,
|
| 911 |
+
"<|LOC_906|>": 101203,
|
| 912 |
+
"<|LOC_907|>": 101204,
|
| 913 |
+
"<|LOC_908|>": 101205,
|
| 914 |
+
"<|LOC_909|>": 101206,
|
| 915 |
+
"<|LOC_90|>": 100387,
|
| 916 |
+
"<|LOC_910|>": 101207,
|
| 917 |
+
"<|LOC_911|>": 101208,
|
| 918 |
+
"<|LOC_912|>": 101209,
|
| 919 |
+
"<|LOC_913|>": 101210,
|
| 920 |
+
"<|LOC_914|>": 101211,
|
| 921 |
+
"<|LOC_915|>": 101212,
|
| 922 |
+
"<|LOC_916|>": 101213,
|
| 923 |
+
"<|LOC_917|>": 101214,
|
| 924 |
+
"<|LOC_918|>": 101215,
|
| 925 |
+
"<|LOC_919|>": 101216,
|
| 926 |
+
"<|LOC_91|>": 100388,
|
| 927 |
+
"<|LOC_920|>": 101217,
|
| 928 |
+
"<|LOC_921|>": 101218,
|
| 929 |
+
"<|LOC_922|>": 101219,
|
| 930 |
+
"<|LOC_923|>": 101220,
|
| 931 |
+
"<|LOC_924|>": 101221,
|
| 932 |
+
"<|LOC_925|>": 101222,
|
| 933 |
+
"<|LOC_926|>": 101223,
|
| 934 |
+
"<|LOC_927|>": 101224,
|
| 935 |
+
"<|LOC_928|>": 101225,
|
| 936 |
+
"<|LOC_929|>": 101226,
|
| 937 |
+
"<|LOC_92|>": 100389,
|
| 938 |
+
"<|LOC_930|>": 101227,
|
| 939 |
+
"<|LOC_931|>": 101228,
|
| 940 |
+
"<|LOC_932|>": 101229,
|
| 941 |
+
"<|LOC_933|>": 101230,
|
| 942 |
+
"<|LOC_934|>": 101231,
|
| 943 |
+
"<|LOC_935|>": 101232,
|
| 944 |
+
"<|LOC_936|>": 101233,
|
| 945 |
+
"<|LOC_937|>": 101234,
|
| 946 |
+
"<|LOC_938|>": 101235,
|
| 947 |
+
"<|LOC_939|>": 101236,
|
| 948 |
+
"<|LOC_93|>": 100390,
|
| 949 |
+
"<|LOC_940|>": 101237,
|
| 950 |
+
"<|LOC_941|>": 101238,
|
| 951 |
+
"<|LOC_942|>": 101239,
|
| 952 |
+
"<|LOC_943|>": 101240,
|
| 953 |
+
"<|LOC_944|>": 101241,
|
| 954 |
+
"<|LOC_945|>": 101242,
|
| 955 |
+
"<|LOC_946|>": 101243,
|
| 956 |
+
"<|LOC_947|>": 101244,
|
| 957 |
+
"<|LOC_948|>": 101245,
|
| 958 |
+
"<|LOC_949|>": 101246,
|
| 959 |
+
"<|LOC_94|>": 100391,
|
| 960 |
+
"<|LOC_950|>": 101247,
|
| 961 |
+
"<|LOC_951|>": 101248,
|
| 962 |
+
"<|LOC_952|>": 101249,
|
| 963 |
+
"<|LOC_953|>": 101250,
|
| 964 |
+
"<|LOC_954|>": 101251,
|
| 965 |
+
"<|LOC_955|>": 101252,
|
| 966 |
+
"<|LOC_956|>": 101253,
|
| 967 |
+
"<|LOC_957|>": 101254,
|
| 968 |
+
"<|LOC_958|>": 101255,
|
| 969 |
+
"<|LOC_959|>": 101256,
|
| 970 |
+
"<|LOC_95|>": 100392,
|
| 971 |
+
"<|LOC_960|>": 101257,
|
| 972 |
+
"<|LOC_961|>": 101258,
|
| 973 |
+
"<|LOC_962|>": 101259,
|
| 974 |
+
"<|LOC_963|>": 101260,
|
| 975 |
+
"<|LOC_964|>": 101261,
|
| 976 |
+
"<|LOC_965|>": 101262,
|
| 977 |
+
"<|LOC_966|>": 101263,
|
| 978 |
+
"<|LOC_967|>": 101264,
|
| 979 |
+
"<|LOC_968|>": 101265,
|
| 980 |
+
"<|LOC_969|>": 101266,
|
| 981 |
+
"<|LOC_96|>": 100393,
|
| 982 |
+
"<|LOC_970|>": 101267,
|
| 983 |
+
"<|LOC_971|>": 101268,
|
| 984 |
+
"<|LOC_972|>": 101269,
|
| 985 |
+
"<|LOC_973|>": 101270,
|
| 986 |
+
"<|LOC_974|>": 101271,
|
| 987 |
+
"<|LOC_975|>": 101272,
|
| 988 |
+
"<|LOC_976|>": 101273,
|
| 989 |
+
"<|LOC_977|>": 101274,
|
| 990 |
+
"<|LOC_978|>": 101275,
|
| 991 |
+
"<|LOC_979|>": 101276,
|
| 992 |
+
"<|LOC_97|>": 100394,
|
| 993 |
+
"<|LOC_980|>": 101277,
|
| 994 |
+
"<|LOC_981|>": 101278,
|
| 995 |
+
"<|LOC_982|>": 101279,
|
| 996 |
+
"<|LOC_983|>": 101280,
|
| 997 |
+
"<|LOC_984|>": 101281,
|
| 998 |
+
"<|LOC_985|>": 101282,
|
| 999 |
+
"<|LOC_986|>": 101283,
|
| 1000 |
+
"<|LOC_987|>": 101284,
|
| 1001 |
+
"<|LOC_988|>": 101285,
|
| 1002 |
+
"<|LOC_989|>": 101286,
|
| 1003 |
+
"<|LOC_98|>": 100395,
|
| 1004 |
+
"<|LOC_990|>": 101287,
|
| 1005 |
+
"<|LOC_991|>": 101288,
|
| 1006 |
+
"<|LOC_992|>": 101289,
|
| 1007 |
+
"<|LOC_993|>": 101290,
|
| 1008 |
+
"<|LOC_994|>": 101291,
|
| 1009 |
+
"<|LOC_995|>": 101292,
|
| 1010 |
+
"<|LOC_996|>": 101293,
|
| 1011 |
+
"<|LOC_997|>": 101294,
|
| 1012 |
+
"<|LOC_998|>": 101295,
|
| 1013 |
+
"<|LOC_999|>": 101296,
|
| 1014 |
+
"<|LOC_99|>": 100396,
|
| 1015 |
+
"<|LOC_9|>": 100306,
|
| 1016 |
+
"<|LOC_BEGIN|>": 101298,
|
| 1017 |
+
"<|LOC_END|>": 101299,
|
| 1018 |
+
"<|LOC_SEP|>": 101300,
|
| 1019 |
+
"<|image_pad|>": 101304,
|
| 1020 |
+
"<|video_pad|>": 101307
|
| 1021 |
+
}
|
chat_template.jinja
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{%- if not add_generation_prompt is defined -%}
|
| 2 |
+
{%- set add_generation_prompt = true -%}
|
| 3 |
+
{%- endif -%}
|
| 4 |
+
{%- if not cls_token is defined -%}
|
| 5 |
+
{%- set cls_token = "<|begin_of_sentence|>" -%}
|
| 6 |
+
{%- endif -%}
|
| 7 |
+
{%- if not eos_token is defined -%}
|
| 8 |
+
{%- set eos_token = "</s>" -%}
|
| 9 |
+
{%- endif -%}
|
| 10 |
+
{%- if not image_token is defined -%}
|
| 11 |
+
{%- set image_token = "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>" -%}
|
| 12 |
+
{%- endif -%}
|
| 13 |
+
{{- cls_token -}}
|
| 14 |
+
{%- for message in messages -%}
|
| 15 |
+
{%- if message["role"] == "user" -%}
|
| 16 |
+
{{- "User: " -}}
|
| 17 |
+
{%- for content in message["content"] -%}
|
| 18 |
+
{%- if content["type"] == "image" -%}
|
| 19 |
+
{{ image_token }}
|
| 20 |
+
{%- endif -%}
|
| 21 |
+
{%- endfor -%}
|
| 22 |
+
{%- for content in message["content"] -%}
|
| 23 |
+
{%- if content["type"] == "text" -%}
|
| 24 |
+
{{ content["text"] }}
|
| 25 |
+
{%- endif -%}
|
| 26 |
+
{%- endfor -%}
|
| 27 |
+
{{ "\n" -}}
|
| 28 |
+
{%- elif message["role"] == "assistant" -%}
|
| 29 |
+
{{- "Assistant: " -}}
|
| 30 |
+
{%- for content in message["content"] -%}
|
| 31 |
+
{%- if content["type"] == "text" -%}
|
| 32 |
+
{{ content["text"] }}
|
| 33 |
+
{%- endif -%}
|
| 34 |
+
{%- endfor -%}
|
| 35 |
+
{{ eos_token -}}
|
| 36 |
+
{%- elif message["role"] == "system" -%}
|
| 37 |
+
{%- for content in message["content"] -%}
|
| 38 |
+
{%- if content["type"] == "text" -%}
|
| 39 |
+
{{ content["text"] + "\n" }}
|
| 40 |
+
{%- endif -%}
|
| 41 |
+
{%- endfor -%}
|
| 42 |
+
{%- endif -%}
|
| 43 |
+
{%- endfor -%}
|
| 44 |
+
{%- if add_generation_prompt -%}
|
| 45 |
+
{{- "Assistant: " -}}
|
| 46 |
+
{%- endif -%}
|
config.json
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"PaddleOCRVLForConditionalGeneration"
|
| 4 |
+
],
|
| 5 |
+
"attention_probs_dropout_prob": 0.0,
|
| 6 |
+
"auto_map": {
|
| 7 |
+
"AutoConfig": "configuration_paddleocr_vl.PaddleOCRVLConfig",
|
| 8 |
+
"AutoModel": "modeling_paddleocr_vl.PaddleOCRVLForConditionalGeneration",
|
| 9 |
+
"AutoModelForCausalLM": "modeling_paddleocr_vl.PaddleOCRVLForConditionalGeneration"
|
| 10 |
+
},
|
| 11 |
+
"compression_ratio": 1.0,
|
| 12 |
+
"dtype": "bfloat16",
|
| 13 |
+
"head_dim": 128,
|
| 14 |
+
"hidden_act": "silu",
|
| 15 |
+
"hidden_dropout_prob": 0.0,
|
| 16 |
+
"hidden_size": 1024,
|
| 17 |
+
"ignored_index": -100,
|
| 18 |
+
"image_token_id": 100295,
|
| 19 |
+
"intermediate_size": 3072,
|
| 20 |
+
"max_position_embeddings": 131072,
|
| 21 |
+
"max_sequence_length": null,
|
| 22 |
+
"model_type": "paddleocr_vl",
|
| 23 |
+
"num_attention_heads": 16,
|
| 24 |
+
"num_hidden_layers": 18,
|
| 25 |
+
"num_key_value_heads": 2,
|
| 26 |
+
"rms_norm_eps": 1e-05,
|
| 27 |
+
"rope_is_neox_style": true,
|
| 28 |
+
"rope_scaling": {
|
| 29 |
+
"mrope_section": [
|
| 30 |
+
16,
|
| 31 |
+
24,
|
| 32 |
+
24
|
| 33 |
+
],
|
| 34 |
+
"rope_type": "default",
|
| 35 |
+
"type": "default"
|
| 36 |
+
},
|
| 37 |
+
"rope_theta": 500000,
|
| 38 |
+
"sliding_window": null,
|
| 39 |
+
"tie_word_embeddings": false,
|
| 40 |
+
"transformers_version": "4.57.1",
|
| 41 |
+
"use_3d_rope": true,
|
| 42 |
+
"use_bias": false,
|
| 43 |
+
"use_cache": false,
|
| 44 |
+
"use_flash_attention": false,
|
| 45 |
+
"video_token_id": 101307,
|
| 46 |
+
"vision_config": {
|
| 47 |
+
"architectures": [
|
| 48 |
+
"SiglipVisionModel"
|
| 49 |
+
],
|
| 50 |
+
"attention_dropout": 0.0,
|
| 51 |
+
"auto_map": {
|
| 52 |
+
"AutoConfig": "configuration_paddleocr_vl.PaddleOCRVLConfig",
|
| 53 |
+
"AutoModel": "modeling_paddleocr_vl.SiglipVisionModel"
|
| 54 |
+
},
|
| 55 |
+
"dtype": "bfloat16",
|
| 56 |
+
"hidden_act": "gelu_pytorch_tanh",
|
| 57 |
+
"hidden_size": 1152,
|
| 58 |
+
"image_size": 384,
|
| 59 |
+
"intermediate_size": 4304,
|
| 60 |
+
"layer_norm_eps": 1e-06,
|
| 61 |
+
"model_type": "paddleocr_vl",
|
| 62 |
+
"num_attention_heads": 16,
|
| 63 |
+
"num_channels": 3,
|
| 64 |
+
"num_hidden_layers": 27,
|
| 65 |
+
"pad_token_id": 0,
|
| 66 |
+
"patch_size": 14,
|
| 67 |
+
"spatial_merge_size": 2,
|
| 68 |
+
"temporal_patch_size": 2,
|
| 69 |
+
"tokens_per_second": 2
|
| 70 |
+
},
|
| 71 |
+
"vision_start_token_id": 101305,
|
| 72 |
+
"vocab_size": 103424,
|
| 73 |
+
"weight_share_add_bias": true
|
| 74 |
+
}
|
configuration_paddleocr_vl.py
ADDED
|
@@ -0,0 +1,191 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
|
| 2 |
+
#
|
| 3 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
| 4 |
+
# you may not use this file except in compliance with the License.
|
| 5 |
+
# You may obtain a copy of the License at
|
| 6 |
+
#
|
| 7 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
| 8 |
+
#
|
| 9 |
+
# Unless required by applicable law or agreed to in writing, software
|
| 10 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
| 11 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
| 12 |
+
# See the License for the specific language governing permissions and
|
| 13 |
+
# limitations under the License.
|
| 14 |
+
|
| 15 |
+
from transformers.configuration_utils import PretrainedConfig
|
| 16 |
+
from transformers.modeling_rope_utils import rope_config_validation
|
| 17 |
+
|
| 18 |
+
class PaddleOCRVisionConfig(PretrainedConfig):
|
| 19 |
+
model_type = "paddleocr_vl"
|
| 20 |
+
base_config_key = "vision_config"
|
| 21 |
+
|
| 22 |
+
def __init__(
|
| 23 |
+
self,
|
| 24 |
+
hidden_size=768,
|
| 25 |
+
intermediate_size=3072,
|
| 26 |
+
num_hidden_layers=12,
|
| 27 |
+
num_attention_heads=12,
|
| 28 |
+
num_channels=3,
|
| 29 |
+
image_size=224,
|
| 30 |
+
patch_size=14,
|
| 31 |
+
hidden_act="gelu_pytorch_tanh",
|
| 32 |
+
layer_norm_eps=1e-6,
|
| 33 |
+
attention_dropout=0.0,
|
| 34 |
+
spatial_merge_size=2,
|
| 35 |
+
temporal_patch_size=2,
|
| 36 |
+
tokens_per_second=2,
|
| 37 |
+
**kwargs,
|
| 38 |
+
):
|
| 39 |
+
super().__init__(**kwargs)
|
| 40 |
+
|
| 41 |
+
self.hidden_size = hidden_size
|
| 42 |
+
self.intermediate_size = intermediate_size
|
| 43 |
+
self.num_hidden_layers = num_hidden_layers
|
| 44 |
+
self.num_attention_heads = num_attention_heads
|
| 45 |
+
self.num_channels = num_channels
|
| 46 |
+
self.patch_size = patch_size
|
| 47 |
+
self.image_size = image_size
|
| 48 |
+
self.attention_dropout = attention_dropout
|
| 49 |
+
self.layer_norm_eps = layer_norm_eps
|
| 50 |
+
self.hidden_act = hidden_act
|
| 51 |
+
self.spatial_merge_size = spatial_merge_size
|
| 52 |
+
self.temporal_patch_size = temporal_patch_size
|
| 53 |
+
self.tokens_per_second = tokens_per_second
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
class PaddleOCRVLConfig(PretrainedConfig):
|
| 58 |
+
"""
|
| 59 |
+
Configuration class.
|
| 60 |
+
|
| 61 |
+
This class stores the configuration of an Ernie model, defining the model architecture.
|
| 62 |
+
It inherits from PretrainedConfig and can be used to control model outputs.
|
| 63 |
+
"""
|
| 64 |
+
|
| 65 |
+
model_type = "paddleocr_vl"
|
| 66 |
+
keys_to_ignore_at_inference = ["past_key_values"]
|
| 67 |
+
sub_configs = {"vision_config": PaddleOCRVisionConfig}
|
| 68 |
+
|
| 69 |
+
# Default tensor parallel plan for base model `Qwen3`
|
| 70 |
+
base_model_tp_plan = {
|
| 71 |
+
"layers.*.self_attn.q_proj": "colwise",
|
| 72 |
+
"layers.*.self_attn.k_proj": "colwise",
|
| 73 |
+
"layers.*.self_attn.v_proj": "colwise",
|
| 74 |
+
"layers.*.self_attn.o_proj": "rowwise",
|
| 75 |
+
"layers.*.mlp.gate_proj": "colwise",
|
| 76 |
+
"layers.*.mlp.up_proj": "colwise",
|
| 77 |
+
"layers.*.mlp.down_proj": "rowwise",
|
| 78 |
+
}
|
| 79 |
+
base_model_pp_plan = {
|
| 80 |
+
"embed_tokens": (["input_ids"], ["inputs_embeds"]),
|
| 81 |
+
"layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
|
| 82 |
+
"norm": (["hidden_states"], ["hidden_states"]),
|
| 83 |
+
}
|
| 84 |
+
|
| 85 |
+
def __init__(
|
| 86 |
+
self,
|
| 87 |
+
vocab_size=32000,
|
| 88 |
+
hidden_size=768,
|
| 89 |
+
intermediate_size=11008,
|
| 90 |
+
max_position_embeddings=32768,
|
| 91 |
+
num_hidden_layers=2,
|
| 92 |
+
num_attention_heads=2,
|
| 93 |
+
image_token_id=101304,
|
| 94 |
+
video_token_id=101305,
|
| 95 |
+
vision_start_token_id=101306,
|
| 96 |
+
rms_norm_eps=1e-6,
|
| 97 |
+
use_cache=False,
|
| 98 |
+
use_flash_attention=False,
|
| 99 |
+
pad_token_id=0,
|
| 100 |
+
bos_token_id=1,
|
| 101 |
+
eos_token_id=2,
|
| 102 |
+
head_dim=128,
|
| 103 |
+
hidden_act="silu",
|
| 104 |
+
use_bias=False,
|
| 105 |
+
rope_theta=10000,
|
| 106 |
+
weight_share_add_bias=True,
|
| 107 |
+
ignored_index=-100,
|
| 108 |
+
attention_probs_dropout_prob=0.0,
|
| 109 |
+
hidden_dropout_prob=0.0,
|
| 110 |
+
compression_ratio: float = 1.0,
|
| 111 |
+
num_key_value_heads=None,
|
| 112 |
+
max_sequence_length=None,
|
| 113 |
+
tie_word_embeddings=False,
|
| 114 |
+
vision_config=None,
|
| 115 |
+
rope_scaling=None,
|
| 116 |
+
**kwargs,
|
| 117 |
+
):
|
| 118 |
+
"""
|
| 119 |
+
Initialize configuration with default or specified parameters.
|
| 120 |
+
|
| 121 |
+
Args:
|
| 122 |
+
vocab_size (int): Size of the vocabulary (number of unique tokens)
|
| 123 |
+
hidden_size (int): Dimensionality of the encoder layers and the pooler layer
|
| 124 |
+
intermediate_size (int): Dimensionality of the "intermediate" (feed-forward) layer
|
| 125 |
+
max_position_embeddings (int): Maximum sequence length the model can handle
|
| 126 |
+
num_hidden_layers (int): Number of hidden layers in the Transformer encoder
|
| 127 |
+
num_attention_heads (int): Number of attention heads for each attention layer
|
| 128 |
+
rms_norm_eps (float): The epsilon used by the RMS normalization layers
|
| 129 |
+
use_cache (bool): Whether to use caching for faster generation (decoding)
|
| 130 |
+
use_flash_attention (bool): Whether to use FlashAttention for optimized attention computation
|
| 131 |
+
pad_token_id (int): Token ID used for padding sequences
|
| 132 |
+
bos_token_id (int): Token ID used for beginning-of-sequence
|
| 133 |
+
eos_token_id (int): Token ID used for end-of-sequence
|
| 134 |
+
use_bias (bool): Whether to use bias terms in linear layers
|
| 135 |
+
rope_theta (float): The base period of the RoPE embeddings
|
| 136 |
+
weight_share_add_bias (bool): Whether to share bias weights in certain layers
|
| 137 |
+
ignored_index (int): Target value that is ignored during loss computation
|
| 138 |
+
attention_probs_dropout_prob (float): Dropout probability for attention weights
|
| 139 |
+
hidden_dropout_prob (float): Dropout probability for hidden layers
|
| 140 |
+
compression_ratio (float): Ratio for KV cache compression (1.0 = no compression)
|
| 141 |
+
num_key_value_heads (int): Number of key/value heads (for Grouped Query Attention)
|
| 142 |
+
max_sequence_length (int): Maximum sequence length for positional embeddings
|
| 143 |
+
**kwargs: Additional keyword arguments passed to parent class
|
| 144 |
+
"""
|
| 145 |
+
|
| 146 |
+
# Set default for tied embeddings if not specified.
|
| 147 |
+
super().__init__(
|
| 148 |
+
pad_token_id=pad_token_id,
|
| 149 |
+
bos_token_id=bos_token_id,
|
| 150 |
+
eos_token_id=eos_token_id,
|
| 151 |
+
**kwargs,
|
| 152 |
+
)
|
| 153 |
+
if isinstance(vision_config, dict):
|
| 154 |
+
self.vision_config = self.sub_configs["vision_config"](**vision_config)
|
| 155 |
+
elif vision_config is None:
|
| 156 |
+
self.vision_config = self.sub_configs["vision_config"]()
|
| 157 |
+
self.vocab_size = vocab_size
|
| 158 |
+
self.hidden_size = hidden_size
|
| 159 |
+
self.intermediate_size = intermediate_size
|
| 160 |
+
self.max_position_embeddings = max_position_embeddings
|
| 161 |
+
self.num_hidden_layers = num_hidden_layers
|
| 162 |
+
self.num_attention_heads = num_attention_heads
|
| 163 |
+
self.rms_norm_eps = rms_norm_eps
|
| 164 |
+
self.use_cache = use_cache
|
| 165 |
+
self.use_flash_attention = use_flash_attention
|
| 166 |
+
self.pad_token_id = pad_token_id
|
| 167 |
+
self.bos_token_id = bos_token_id
|
| 168 |
+
self.eos_token_id = eos_token_id
|
| 169 |
+
self.image_token_id = image_token_id
|
| 170 |
+
self.video_token_id = video_token_id
|
| 171 |
+
self.vision_start_token_id = vision_start_token_id
|
| 172 |
+
self.head_dim = head_dim
|
| 173 |
+
self.hidden_act=hidden_act
|
| 174 |
+
self.sliding_window = None
|
| 175 |
+
self.hidden_size = hidden_size
|
| 176 |
+
self.use_bias = use_bias
|
| 177 |
+
self.weight_share_add_bias = weight_share_add_bias
|
| 178 |
+
self.rope_theta = rope_theta
|
| 179 |
+
self.ignored_index = ignored_index
|
| 180 |
+
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
| 181 |
+
self.hidden_dropout_prob = hidden_dropout_prob
|
| 182 |
+
self.compression_ratio = compression_ratio
|
| 183 |
+
self.num_key_value_heads = num_key_value_heads
|
| 184 |
+
self.max_sequence_length = max_sequence_length
|
| 185 |
+
self.rope_scaling = rope_scaling
|
| 186 |
+
if self.rope_scaling is not None and "type" in self.rope_scaling:
|
| 187 |
+
if self.rope_scaling["type"] == "mrope":
|
| 188 |
+
self.rope_scaling["type"] = "default"
|
| 189 |
+
self.rope_scaling["rope_type"] = self.rope_scaling["type"]
|
| 190 |
+
rope_config_validation(self, ignore_keys={"mrope_section"})
|
| 191 |
+
super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
|
examples/01.png
ADDED
|
Git LFS Details
|
examples/02.png
ADDED
|
Git LFS Details
|
examples/03.png
ADDED
|
examples/04.png
ADDED
|
Git LFS Details
|
examples/05.png
ADDED
|
Git LFS Details
|
generation_config.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_from_model_config": true,
|
| 3 |
+
"eos_token_id": 2,
|
| 4 |
+
"pad_token_id": 0,
|
| 5 |
+
"transformers_version": "4.57.1",
|
| 6 |
+
"use_cache": false
|
| 7 |
+
}
|
image_processing.py
ADDED
|
@@ -0,0 +1,563 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
|
| 2 |
+
#
|
| 3 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
| 4 |
+
# you may not use this file except in compliance with the License.
|
| 5 |
+
# You may obtain a copy of the License at
|
| 6 |
+
#
|
| 7 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
| 8 |
+
#
|
| 9 |
+
# Unless required by applicable law or agreed to in writing, software
|
| 10 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
| 11 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
| 12 |
+
# See the License for the specific language governing permissions and
|
| 13 |
+
# limitations under the License.
|
| 14 |
+
|
| 15 |
+
"""Image processor class for PaddleOCR-VL."""
|
| 16 |
+
|
| 17 |
+
import math
|
| 18 |
+
from typing import Dict, List, Optional, Union
|
| 19 |
+
|
| 20 |
+
import numpy as np
|
| 21 |
+
import torch
|
| 22 |
+
from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
|
| 23 |
+
from torchvision.transforms import functional as TF
|
| 24 |
+
from transformers.image_transforms import (
|
| 25 |
+
convert_to_rgb,
|
| 26 |
+
resize,
|
| 27 |
+
to_channel_dimension_format,
|
| 28 |
+
)
|
| 29 |
+
from transformers.image_utils import (
|
| 30 |
+
OPENAI_CLIP_MEAN,
|
| 31 |
+
OPENAI_CLIP_STD,
|
| 32 |
+
ChannelDimension,
|
| 33 |
+
PILImageResampling,
|
| 34 |
+
get_image_size,
|
| 35 |
+
infer_channel_dimension_format,
|
| 36 |
+
is_scaled_image,
|
| 37 |
+
is_valid_image,
|
| 38 |
+
make_list_of_images,
|
| 39 |
+
to_numpy_array,
|
| 40 |
+
valid_images,
|
| 41 |
+
validate_preprocess_arguments,
|
| 42 |
+
)
|
| 43 |
+
from transformers.utils import TensorType, is_vision_available, logging
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
logger = logging.get_logger(__name__)
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
if is_vision_available():
|
| 50 |
+
from PIL import Image
|
| 51 |
+
|
| 52 |
+
ImageInput = Union[
|
| 53 |
+
"PIL.Image.Image",
|
| 54 |
+
np.ndarray,
|
| 55 |
+
"torch.Tensor",
|
| 56 |
+
List["PIL.Image.Image"],
|
| 57 |
+
List[np.ndarray],
|
| 58 |
+
List["torch.Tensor"],
|
| 59 |
+
] # noqa
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
VideoInput = Union[
|
| 63 |
+
List["PIL.Image.Image"],
|
| 64 |
+
"np.ndarray",
|
| 65 |
+
"torch.Tensor",
|
| 66 |
+
List["np.ndarray"],
|
| 67 |
+
List["torch.Tensor"],
|
| 68 |
+
List[List["PIL.Image.Image"]],
|
| 69 |
+
List[List["np.ndarrray"]],
|
| 70 |
+
List[List["torch.Tensor"]],
|
| 71 |
+
] # noqa
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def make_batched_images(images) -> List[List[ImageInput]]:
|
| 75 |
+
"""
|
| 76 |
+
Accepts images in list or nested list format, and makes a list of images for preprocessing.
|
| 77 |
+
|
| 78 |
+
Args:
|
| 79 |
+
images (`Union[List[List[ImageInput]], List[ImageInput], ImageInput]`):
|
| 80 |
+
The input image.
|
| 81 |
+
|
| 82 |
+
Returns:
|
| 83 |
+
list: A list of images.
|
| 84 |
+
"""
|
| 85 |
+
if (
|
| 86 |
+
isinstance(images, (list, tuple))
|
| 87 |
+
and isinstance(images[0], (list, tuple))
|
| 88 |
+
and is_valid_image(images[0][0])
|
| 89 |
+
):
|
| 90 |
+
return [img for img_list in images for img in img_list]
|
| 91 |
+
|
| 92 |
+
elif isinstance(images, (list, tuple)) and is_valid_image(images[0]):
|
| 93 |
+
return images
|
| 94 |
+
|
| 95 |
+
elif is_valid_image(images):
|
| 96 |
+
return [images]
|
| 97 |
+
|
| 98 |
+
raise ValueError(f"Could not make batched images from {images}")
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
def adjust_size(size, patch_size):
|
| 102 |
+
num_patches = size // patch_size
|
| 103 |
+
if num_patches % 2 != 0: # 如果是奇数,减1
|
| 104 |
+
num_patches -= 1
|
| 105 |
+
return num_patches * patch_size
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
def make_batched_videos(videos) -> List[VideoInput]:
|
| 109 |
+
if (
|
| 110 |
+
isinstance(videos, (list, tuple))
|
| 111 |
+
and isinstance(videos[0], (list, tuple))
|
| 112 |
+
and is_valid_image(videos[0][0])
|
| 113 |
+
):
|
| 114 |
+
return videos
|
| 115 |
+
|
| 116 |
+
elif isinstance(videos, (list, tuple)) and is_valid_image(videos[0]):
|
| 117 |
+
if isinstance(videos[0], Image.Image):
|
| 118 |
+
return [videos]
|
| 119 |
+
elif len(videos[0].shape) == 4:
|
| 120 |
+
return [list(video) for video in videos]
|
| 121 |
+
|
| 122 |
+
elif is_valid_image(videos) and len(videos.shape) == 4:
|
| 123 |
+
return [list(videos)]
|
| 124 |
+
|
| 125 |
+
raise ValueError(f"Could not make batched video from {videos}")
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
def smart_resize(
|
| 129 |
+
height: int,
|
| 130 |
+
width: int,
|
| 131 |
+
factor: int = 28,
|
| 132 |
+
min_pixels: int = 28 * 28 * 130,
|
| 133 |
+
max_pixels: int = 28 * 28 * 1280,
|
| 134 |
+
):
|
| 135 |
+
"""Rescales the image so that the following conditions are met:
|
| 136 |
+
|
| 137 |
+
1. Both dimensions (height and width) are divisible by 'factor'.
|
| 138 |
+
|
| 139 |
+
2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
|
| 140 |
+
|
| 141 |
+
3. The aspect ratio of the image is maintained as closely as possible.
|
| 142 |
+
|
| 143 |
+
"""
|
| 144 |
+
|
| 145 |
+
if height < factor:
|
| 146 |
+
width = round((width * factor) / height)
|
| 147 |
+
height = factor
|
| 148 |
+
|
| 149 |
+
if width < factor:
|
| 150 |
+
height = round((height * factor) / width)
|
| 151 |
+
width = factor
|
| 152 |
+
|
| 153 |
+
if max(height, width) / min(height, width) > 200:
|
| 154 |
+
raise ValueError(
|
| 155 |
+
f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
|
| 156 |
+
)
|
| 157 |
+
h_bar = round(height / factor) * factor
|
| 158 |
+
w_bar = round(width / factor) * factor
|
| 159 |
+
if h_bar * w_bar > max_pixels:
|
| 160 |
+
beta = math.sqrt((height * width) / max_pixels)
|
| 161 |
+
h_bar = math.floor(height / beta / factor) * factor
|
| 162 |
+
w_bar = math.floor(width / beta / factor) * factor
|
| 163 |
+
elif h_bar * w_bar < min_pixels:
|
| 164 |
+
beta = math.sqrt(min_pixels / (height * width))
|
| 165 |
+
h_bar = math.ceil(height * beta / factor) * factor
|
| 166 |
+
w_bar = math.ceil(width * beta / factor) * factor
|
| 167 |
+
return h_bar, w_bar
|
| 168 |
+
|
| 169 |
+
|
| 170 |
+
class SiglipImageProcessor(BaseImageProcessor):
|
| 171 |
+
r"""
|
| 172 |
+
Constructs a Siglip image processor that dynamically resizes images based on the original images.
|
| 173 |
+
|
| 174 |
+
Args:
|
| 175 |
+
do_resize (`bool`, *optional*, defaults to `True`):
|
| 176 |
+
Whether to resize the image's (height, width) dimensions.
|
| 177 |
+
resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
|
| 178 |
+
Resampling filter to use when resizing the image.
|
| 179 |
+
do_rescale (`bool`, *optional*, defaults to `True`):
|
| 180 |
+
Whether to rescale the image by the specified scale `rescale_factor`.
|
| 181 |
+
rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
|
| 182 |
+
Scale factor to use if rescaling the image.
|
| 183 |
+
do_normalize (`bool`, *optional*, defaults to `True`):
|
| 184 |
+
Whether to normalize the image.
|
| 185 |
+
image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
|
| 186 |
+
Mean to use if normalizing the image. This is a float or list of floats for each channel in the image.
|
| 187 |
+
image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
|
| 188 |
+
Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image.
|
| 189 |
+
do_convert_rgb (`bool`, *optional*, defaults to `True`):
|
| 190 |
+
Whether to convert the image to RGB.
|
| 191 |
+
min_pixels (`int`, *optional*, defaults to `28 * 28 * 130`):
|
| 192 |
+
The min pixels of the image to resize the image.
|
| 193 |
+
max_pixels (`int`, *optional*, defaults to `28 * 28 * 1670`):
|
| 194 |
+
The max pixels of the image to resize the image.
|
| 195 |
+
patch_size (`int`, *optional*, defaults to 14):
|
| 196 |
+
The spacial patch size of the vision encoder.
|
| 197 |
+
temporal_patch_size (`int`, *optional*, defaults to 2):
|
| 198 |
+
The temporal patch size of the vision encoder.
|
| 199 |
+
merge_size (`int`, *optional*, defaults to 2):
|
| 200 |
+
The merge size of the vision encoder to llm encoder.
|
| 201 |
+
"""
|
| 202 |
+
|
| 203 |
+
model_input_names = [
|
| 204 |
+
"pixel_values",
|
| 205 |
+
"image_grid_thw",
|
| 206 |
+
"pixel_values_videos",
|
| 207 |
+
"video_grid_thw",
|
| 208 |
+
]
|
| 209 |
+
|
| 210 |
+
def __init__(
|
| 211 |
+
self,
|
| 212 |
+
do_resize: bool = True,
|
| 213 |
+
resample: PILImageResampling = PILImageResampling.BICUBIC,
|
| 214 |
+
do_rescale: bool = True,
|
| 215 |
+
rescale_factor: Union[int, float] = 1 / 255,
|
| 216 |
+
do_normalize: bool = True,
|
| 217 |
+
image_mean: Optional[Union[float, List[float]]] = None,
|
| 218 |
+
image_std: Optional[Union[float, List[float]]] = None,
|
| 219 |
+
do_convert_rgb: bool = True,
|
| 220 |
+
min_pixels: int = 28 * 28 * 130,
|
| 221 |
+
max_pixels: int = 28 * 28 * 1280,
|
| 222 |
+
patch_size: int = 14,
|
| 223 |
+
temporal_patch_size: int = 1,
|
| 224 |
+
merge_size: int = 2,
|
| 225 |
+
**kwargs,
|
| 226 |
+
) -> None:
|
| 227 |
+
super().__init__(**kwargs)
|
| 228 |
+
self.do_resize = do_resize
|
| 229 |
+
self.resample = resample
|
| 230 |
+
self.do_rescale = do_rescale
|
| 231 |
+
self.rescale_factor = rescale_factor
|
| 232 |
+
self.do_normalize = do_normalize
|
| 233 |
+
self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
|
| 234 |
+
self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
|
| 235 |
+
self.min_pixels = min_pixels
|
| 236 |
+
self.max_pixels = max_pixels
|
| 237 |
+
self.patch_size = patch_size
|
| 238 |
+
self.temporal_patch_size = temporal_patch_size
|
| 239 |
+
self.merge_size = merge_size
|
| 240 |
+
self.size = {"min_pixels": min_pixels, "max_pixels": max_pixels} # not used
|
| 241 |
+
self.do_convert_rgb = do_convert_rgb
|
| 242 |
+
|
| 243 |
+
def mvit_rescale(self, image: Image.Image, merge_size: int = 2) -> Image.Image:
|
| 244 |
+
try:
|
| 245 |
+
w, h = image.size
|
| 246 |
+
except:
|
| 247 |
+
raise ValueError(str((type(image), image)))
|
| 248 |
+
patch_size = self.patch_size
|
| 249 |
+
|
| 250 |
+
if (w // patch_size) * (h // patch_size) > self.in_token_limit:
|
| 251 |
+
scale = math.sqrt(
|
| 252 |
+
self.in_token_limit / ((w // patch_size) * (h // patch_size))
|
| 253 |
+
)
|
| 254 |
+
new_w, new_h = int(w * scale), int(h * scale)
|
| 255 |
+
|
| 256 |
+
image = image.resize((new_w, new_h), Image.Resampling.BICUBIC)
|
| 257 |
+
if self.pad_input:
|
| 258 |
+
new_w, new_h = image.size
|
| 259 |
+
pad_size_h = merge_size * patch_size
|
| 260 |
+
pad_size_w = merge_size * patch_size
|
| 261 |
+
|
| 262 |
+
pad_h = (pad_size_h - new_h % pad_size_h) % pad_size_h
|
| 263 |
+
pad_w = (pad_size_w - new_w % pad_size_w) % pad_size_w
|
| 264 |
+
|
| 265 |
+
image = TF.pad(image, (0, 0, pad_w, pad_h))
|
| 266 |
+
else:
|
| 267 |
+
new_w, new_h = image.size
|
| 268 |
+
new_w = new_w - new_w % patch_size
|
| 269 |
+
new_h = new_h - new_h % patch_size
|
| 270 |
+
|
| 271 |
+
new_w = adjust_size(new_w, patch_size)
|
| 272 |
+
new_h = adjust_size(new_h, patch_size)
|
| 273 |
+
|
| 274 |
+
image = TF.center_crop(image, (new_h, new_w))
|
| 275 |
+
|
| 276 |
+
w, h = image.size
|
| 277 |
+
if w // patch_size >= 512 or h // patch_size >= 512:
|
| 278 |
+
new_h = min(patch_size * 510, h)
|
| 279 |
+
new_w = min(patch_size * 510, w)
|
| 280 |
+
image = TF.center_crop(image, (new_h, new_w))
|
| 281 |
+
# raise ValueError("Exceed pos emb")
|
| 282 |
+
return image
|
| 283 |
+
|
| 284 |
+
def _preprocess(
|
| 285 |
+
self,
|
| 286 |
+
images: Union[ImageInput, VideoInput],
|
| 287 |
+
do_resize: bool = None,
|
| 288 |
+
resample: PILImageResampling = None,
|
| 289 |
+
do_rescale: bool = None,
|
| 290 |
+
rescale_factor: float = None,
|
| 291 |
+
do_normalize: bool = None,
|
| 292 |
+
image_mean: Optional[Union[float, List[float]]] = None,
|
| 293 |
+
image_std: Optional[Union[float, List[float]]] = None,
|
| 294 |
+
do_convert_rgb: bool = None,
|
| 295 |
+
data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
|
| 296 |
+
input_data_format: Optional[Union[str, ChannelDimension]] = None,
|
| 297 |
+
):
|
| 298 |
+
"""
|
| 299 |
+
Preprocess an image or batch of images. Copy of the `preprocess` method from `CLIPImageProcessor`.
|
| 300 |
+
|
| 301 |
+
Args:
|
| 302 |
+
images (`ImageInput`):
|
| 303 |
+
Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
|
| 304 |
+
vision_info (`List[Dict]`, *optional*):
|
| 305 |
+
Optional list of dictionaries containing additional information about vision inputs.
|
| 306 |
+
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
|
| 307 |
+
Whether to resize the image.
|
| 308 |
+
resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
|
| 309 |
+
Resampling filter to use if resizing the image. This can be one of the `PILImageResampling` enums.
|
| 310 |
+
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
|
| 311 |
+
Whether to rescale the image.
|
| 312 |
+
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
|
| 313 |
+
Scale factor to use if rescaling the image.
|
| 314 |
+
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
|
| 315 |
+
Whether to normalize the image.
|
| 316 |
+
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
|
| 317 |
+
Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
|
| 318 |
+
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
|
| 319 |
+
Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
|
| 320 |
+
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
|
| 321 |
+
Whether to convert the image to RGB.
|
| 322 |
+
data_format (`ChannelDimension`, *optional*, defaults to `ChannelDimension.FIRST`):
|
| 323 |
+
The channel dimension format for the output image. Can be one of:
|
| 324 |
+
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
|
| 325 |
+
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
|
| 326 |
+
- Unset: Use the channel dimension format of the input image.
|
| 327 |
+
input_data_format (`ChannelDimension` or `str`, *optional*):
|
| 328 |
+
The channel dimension format for the input image. Can be one of:
|
| 329 |
+
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
|
| 330 |
+
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
|
| 331 |
+
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format. - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
|
| 332 |
+
"""
|
| 333 |
+
images = make_list_of_images(images)
|
| 334 |
+
|
| 335 |
+
if do_convert_rgb:
|
| 336 |
+
images = [convert_to_rgb(image) for image in images]
|
| 337 |
+
|
| 338 |
+
# All transformations expect numpy arrays.
|
| 339 |
+
images = [to_numpy_array(image) for image in images]
|
| 340 |
+
|
| 341 |
+
if is_scaled_image(images[0]) and do_rescale:
|
| 342 |
+
logger.warning_once(
|
| 343 |
+
"It looks like you are trying to rescale already rescaled images. If the input"
|
| 344 |
+
" images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
|
| 345 |
+
)
|
| 346 |
+
if input_data_format is None:
|
| 347 |
+
# We assume that all images have the same channel dimension format.
|
| 348 |
+
input_data_format = infer_channel_dimension_format(images[0])
|
| 349 |
+
|
| 350 |
+
height, width = get_image_size(images[0], channel_dim=input_data_format)
|
| 351 |
+
resized_height, resized_width = height, width
|
| 352 |
+
processed_images = []
|
| 353 |
+
|
| 354 |
+
for image in images:
|
| 355 |
+
if do_resize:
|
| 356 |
+
resized_height, resized_width = smart_resize(
|
| 357 |
+
height,
|
| 358 |
+
width,
|
| 359 |
+
factor=self.patch_size * self.merge_size,
|
| 360 |
+
min_pixels=self.min_pixels,
|
| 361 |
+
max_pixels=self.max_pixels,
|
| 362 |
+
)
|
| 363 |
+
image = resize(
|
| 364 |
+
image,
|
| 365 |
+
size=(resized_height, resized_width),
|
| 366 |
+
resample=resample,
|
| 367 |
+
input_data_format=input_data_format,
|
| 368 |
+
)
|
| 369 |
+
|
| 370 |
+
if do_rescale:
|
| 371 |
+
image = self.rescale(
|
| 372 |
+
image, scale=rescale_factor, input_data_format=input_data_format
|
| 373 |
+
)
|
| 374 |
+
|
| 375 |
+
if do_normalize:
|
| 376 |
+
image = self.normalize(
|
| 377 |
+
image=image,
|
| 378 |
+
mean=image_mean,
|
| 379 |
+
std=image_std,
|
| 380 |
+
input_data_format=input_data_format,
|
| 381 |
+
)
|
| 382 |
+
image = to_channel_dimension_format(
|
| 383 |
+
image, data_format, input_channel_dim=input_data_format
|
| 384 |
+
)
|
| 385 |
+
processed_images.append(image)
|
| 386 |
+
|
| 387 |
+
patches = np.array(processed_images)
|
| 388 |
+
if data_format == ChannelDimension.LAST:
|
| 389 |
+
patches = patches.transpose(0, 3, 1, 2)
|
| 390 |
+
if patches.shape[0] == 1:
|
| 391 |
+
patches = np.tile(patches, (self.temporal_patch_size, 1, 1, 1))
|
| 392 |
+
init_patches = patches
|
| 393 |
+
channel = patches.shape[1]
|
| 394 |
+
grid_t = patches.shape[0] // self.temporal_patch_size
|
| 395 |
+
grid_h, grid_w = (
|
| 396 |
+
resized_height // self.patch_size,
|
| 397 |
+
resized_width // self.patch_size,
|
| 398 |
+
)
|
| 399 |
+
patches = patches.reshape(
|
| 400 |
+
grid_t,
|
| 401 |
+
self.temporal_patch_size,
|
| 402 |
+
channel,
|
| 403 |
+
grid_h,
|
| 404 |
+
self.patch_size,
|
| 405 |
+
grid_w,
|
| 406 |
+
self.patch_size,
|
| 407 |
+
)
|
| 408 |
+
patches = patches.transpose(0, 3, 5, 2, 1, 4, 6)
|
| 409 |
+
assert self.temporal_patch_size == 1
|
| 410 |
+
flatten_patches = patches.reshape(
|
| 411 |
+
grid_t * grid_h * grid_w, channel, self.patch_size, self.patch_size
|
| 412 |
+
)
|
| 413 |
+
return flatten_patches, (grid_t, grid_h, grid_w)
|
| 414 |
+
|
| 415 |
+
def preprocess(
|
| 416 |
+
self,
|
| 417 |
+
images: ImageInput,
|
| 418 |
+
videos: VideoInput = None,
|
| 419 |
+
do_resize: bool = None,
|
| 420 |
+
size: Dict[str, int] = None,
|
| 421 |
+
resample: PILImageResampling = None,
|
| 422 |
+
do_rescale: bool = None,
|
| 423 |
+
rescale_factor: float = None,
|
| 424 |
+
do_normalize: bool = None,
|
| 425 |
+
image_mean: Optional[Union[float, List[float]]] = None,
|
| 426 |
+
image_std: Optional[Union[float, List[float]]] = None,
|
| 427 |
+
do_convert_rgb: bool = None,
|
| 428 |
+
return_tensors: Optional[Union[str, TensorType]] = None,
|
| 429 |
+
data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
|
| 430 |
+
input_data_format: Optional[Union[str, ChannelDimension]] = None,
|
| 431 |
+
):
|
| 432 |
+
"""
|
| 433 |
+
Args:
|
| 434 |
+
images (`ImageInput`):
|
| 435 |
+
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
|
| 436 |
+
passing in images with pixel values between 0 and 1, set `do_rescale=False`.
|
| 437 |
+
videos (`VideoInput`):
|
| 438 |
+
Video to preprocess. Expects a single or batch of videos with pixel values ranging from 0 to 255. If
|
| 439 |
+
passing in videos with pixel values between 0 and 1, set `do_rescale=False`.
|
| 440 |
+
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
|
| 441 |
+
Whether to resize the image.
|
| 442 |
+
size (`Dict[str, int]`, *optional*, defaults to `self.size`):
|
| 443 |
+
Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
|
| 444 |
+
the longest edge resized to keep the input aspect ratio.
|
| 445 |
+
resample (`int`, *optional*, defaults to `self.resample`):
|
| 446 |
+
Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
|
| 447 |
+
has an effect if `do_resize` is set to `True`.
|
| 448 |
+
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
|
| 449 |
+
Whether to rescale the image.
|
| 450 |
+
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
|
| 451 |
+
Rescale factor to rescale the image by if `do_rescale` is set to `True`.
|
| 452 |
+
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
|
| 453 |
+
Whether to normalize the image.
|
| 454 |
+
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
|
| 455 |
+
Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
|
| 456 |
+
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
|
| 457 |
+
Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
|
| 458 |
+
`True`.
|
| 459 |
+
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
|
| 460 |
+
Whether to convert the image to RGB.
|
| 461 |
+
return_tensors (`str` or `TensorType`, *optional*):
|
| 462 |
+
The type of tensors to return. Can be one of:
|
| 463 |
+
- Unset: Return a list of `np.ndarray`.
|
| 464 |
+
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
|
| 465 |
+
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
|
| 466 |
+
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
|
| 467 |
+
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
|
| 468 |
+
data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
|
| 469 |
+
The channel dimension format for the output image. Can be one of:
|
| 470 |
+
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
|
| 471 |
+
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
|
| 472 |
+
- Unset: Use the channel dimension format of the input image.
|
| 473 |
+
input_data_format (`ChannelDimension` or `str`, *optional*):
|
| 474 |
+
The channel dimension format for the input image. If unset, the channel dimension format is inferred
|
| 475 |
+
from the input image. Can be one of:
|
| 476 |
+
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
|
| 477 |
+
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
|
| 478 |
+
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
|
| 479 |
+
|
| 480 |
+
"""
|
| 481 |
+
do_resize = do_resize if do_resize is not None else self.do_resize
|
| 482 |
+
size = size if size is not None else self.size
|
| 483 |
+
resample = resample if resample is not None else self.resample
|
| 484 |
+
do_rescale = do_rescale if do_rescale is not None else self.do_rescale
|
| 485 |
+
rescale_factor = (
|
| 486 |
+
rescale_factor if rescale_factor is not None else self.rescale_factor
|
| 487 |
+
)
|
| 488 |
+
do_normalize = do_normalize if do_normalize is not None else self.do_normalize
|
| 489 |
+
image_mean = image_mean if image_mean is not None else self.image_mean
|
| 490 |
+
image_std = image_std if image_std is not None else self.image_std
|
| 491 |
+
do_convert_rgb = (
|
| 492 |
+
do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
|
| 493 |
+
)
|
| 494 |
+
|
| 495 |
+
if images is not None:
|
| 496 |
+
images = make_batched_images(images)
|
| 497 |
+
if videos is not None:
|
| 498 |
+
videos = make_batched_videos(videos)
|
| 499 |
+
|
| 500 |
+
if images is not None and not valid_images(images):
|
| 501 |
+
raise ValueError(
|
| 502 |
+
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
|
| 503 |
+
"torch.Tensor, tf.Tensor or jax.ndarray."
|
| 504 |
+
)
|
| 505 |
+
|
| 506 |
+
validate_preprocess_arguments(
|
| 507 |
+
rescale_factor=rescale_factor,
|
| 508 |
+
do_normalize=do_normalize,
|
| 509 |
+
image_mean=image_mean,
|
| 510 |
+
image_std=image_std,
|
| 511 |
+
do_resize=do_resize,
|
| 512 |
+
size=size,
|
| 513 |
+
resample=resample,
|
| 514 |
+
)
|
| 515 |
+
|
| 516 |
+
if images is not None:
|
| 517 |
+
pixel_values, vision_grid_thws = [], []
|
| 518 |
+
for image in images:
|
| 519 |
+
patches, image_grid_thw = self._preprocess(
|
| 520 |
+
image,
|
| 521 |
+
do_resize=do_resize,
|
| 522 |
+
resample=resample,
|
| 523 |
+
do_rescale=do_rescale,
|
| 524 |
+
rescale_factor=rescale_factor,
|
| 525 |
+
do_normalize=do_normalize,
|
| 526 |
+
image_mean=image_mean,
|
| 527 |
+
image_std=image_std,
|
| 528 |
+
data_format=data_format,
|
| 529 |
+
do_convert_rgb=do_convert_rgb,
|
| 530 |
+
input_data_format=input_data_format,
|
| 531 |
+
)
|
| 532 |
+
pixel_values.extend(patches)
|
| 533 |
+
vision_grid_thws.append(image_grid_thw)
|
| 534 |
+
pixel_values = np.array(pixel_values)
|
| 535 |
+
vision_grid_thws = np.array(vision_grid_thws)
|
| 536 |
+
data = {"pixel_values": pixel_values, "image_grid_thw": vision_grid_thws}
|
| 537 |
+
|
| 538 |
+
if videos is not None:
|
| 539 |
+
pixel_values, vision_grid_thws = [], []
|
| 540 |
+
for images in videos:
|
| 541 |
+
patches, video_grid_thw = self._preprocess(
|
| 542 |
+
images,
|
| 543 |
+
do_resize=do_resize,
|
| 544 |
+
resample=resample,
|
| 545 |
+
do_rescale=do_rescale,
|
| 546 |
+
rescale_factor=rescale_factor,
|
| 547 |
+
do_normalize=do_normalize,
|
| 548 |
+
image_mean=image_mean,
|
| 549 |
+
image_std=image_std,
|
| 550 |
+
data_format=data_format,
|
| 551 |
+
do_convert_rgb=do_convert_rgb,
|
| 552 |
+
input_data_format=input_data_format,
|
| 553 |
+
)
|
| 554 |
+
pixel_values.extend(patches)
|
| 555 |
+
vision_grid_thws.append(video_grid_thw)
|
| 556 |
+
pixel_values = np.array(pixel_values)
|
| 557 |
+
vision_grid_thws = np.array(vision_grid_thws)
|
| 558 |
+
data = {
|
| 559 |
+
"pixel_values_videos": pixel_values,
|
| 560 |
+
"video_grid_thw": vision_grid_thws,
|
| 561 |
+
}
|
| 562 |
+
|
| 563 |
+
return BatchFeature(data=data, tensor_type=return_tensors)
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:71fcee0e3618582d4c8acc705242aa79b471b6134e7023bf3820642ba638b602
|
| 3 |
+
size 1917255968
|
modeling_paddleocr_vl.py
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
preprocessor_config.json
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"auto_map": {
|
| 3 |
+
"AutoImageProcessor": "image_processing.SiglipImageProcessor",
|
| 4 |
+
"AutoProcessor": "processing_paddleocr_vl.PaddleOCRVLProcessor"
|
| 5 |
+
},
|
| 6 |
+
"do_convert_rgb": true,
|
| 7 |
+
"do_normalize": true,
|
| 8 |
+
"do_rescale": true,
|
| 9 |
+
"do_resize": true,
|
| 10 |
+
"image_mean": [
|
| 11 |
+
0.5,
|
| 12 |
+
0.5,
|
| 13 |
+
0.5
|
| 14 |
+
],
|
| 15 |
+
"image_processor_type": "SiglipImageProcessor",
|
| 16 |
+
"image_std": [
|
| 17 |
+
0.5,
|
| 18 |
+
0.5,
|
| 19 |
+
0.5
|
| 20 |
+
],
|
| 21 |
+
"max_pixels": 2822400,
|
| 22 |
+
"merge_size": 2,
|
| 23 |
+
"min_pixels": 147384,
|
| 24 |
+
"patch_size": 14,
|
| 25 |
+
"processor_class": "PaddleOCRVLProcessor",
|
| 26 |
+
"resample": 3,
|
| 27 |
+
"rescale_factor": 0.00392156862745098,
|
| 28 |
+
"size": {
|
| 29 |
+
"max_pixels": 2822400,
|
| 30 |
+
"min_pixels": 147384
|
| 31 |
+
},
|
| 32 |
+
"temporal_patch_size": 1
|
| 33 |
+
}
|
processing_paddleocr_vl.py
ADDED
|
@@ -0,0 +1,293 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
|
| 2 |
+
#
|
| 3 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
| 4 |
+
# you may not use this file except in compliance with the License.
|
| 5 |
+
# You may obtain a copy of the License at
|
| 6 |
+
#
|
| 7 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
| 8 |
+
#
|
| 9 |
+
# Unless required by applicable law or agreed to in writing, software
|
| 10 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
| 11 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
| 12 |
+
# See the License for the specific language governing permissions and
|
| 13 |
+
# limitations under the License.
|
| 14 |
+
|
| 15 |
+
from typing import List, Union
|
| 16 |
+
import numpy as np
|
| 17 |
+
import torch
|
| 18 |
+
from transformers.feature_extraction_utils import BatchFeature
|
| 19 |
+
from transformers.processing_utils import (
|
| 20 |
+
ProcessingKwargs,
|
| 21 |
+
ProcessorMixin,
|
| 22 |
+
Unpack,
|
| 23 |
+
VideosKwargs,
|
| 24 |
+
)
|
| 25 |
+
from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
ImageInput = Union[
|
| 29 |
+
"PIL.Image.Image",
|
| 30 |
+
np.ndarray,
|
| 31 |
+
"torch.Tensor",
|
| 32 |
+
List["PIL.Image.Image"],
|
| 33 |
+
List[np.ndarray],
|
| 34 |
+
List["torch.Tensor"],
|
| 35 |
+
] # noqa
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
VideoInput = Union[
|
| 39 |
+
List["PIL.Image.Image"],
|
| 40 |
+
"np.ndarray",
|
| 41 |
+
"torch.Tensor",
|
| 42 |
+
List["np.ndarray"],
|
| 43 |
+
List["torch.Tensor"],
|
| 44 |
+
List[List["PIL.Image.Image"]],
|
| 45 |
+
List[List["np.ndarrray"]],
|
| 46 |
+
List[List["torch.Tensor"]],
|
| 47 |
+
] # noqa
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
class PaddleOCRVLVideosProcessorKwargs(VideosKwargs, total=False):
|
| 51 |
+
fps: Union[List[float], float]
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
class PaddleOCRVLProcessorKwargs(ProcessingKwargs, total=False):
|
| 55 |
+
videos_kwargs: PaddleOCRVLVideosProcessorKwargs
|
| 56 |
+
_defaults = {
|
| 57 |
+
"text_kwargs": {
|
| 58 |
+
"padding": False,
|
| 59 |
+
},
|
| 60 |
+
"videos_kwargs": {"fps": 2.0},
|
| 61 |
+
}
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
class PaddleOCRVLProcessor(ProcessorMixin):
|
| 65 |
+
r"""
|
| 66 |
+
[`PaddleOCRVLProcessor`] offers all the functionalities of [`SiglipImageProcessor`] and [`Qwen2TokenizerFast`]. See the
|
| 67 |
+
[`~PaddleOCRVLProcessor.__call__`] and [`~PaddleOCRVLProcessor.decode`] for more information.
|
| 68 |
+
Args:
|
| 69 |
+
image_processor ([`SiglipImageProcessor`], *optional*):
|
| 70 |
+
The image processor is a required input.
|
| 71 |
+
tokenizer ([`Qwen2TokenizerFast`], *optional*):
|
| 72 |
+
The tokenizer is a required input.
|
| 73 |
+
chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
|
| 74 |
+
in a chat into a tokenizable string.
|
| 75 |
+
"""
|
| 76 |
+
|
| 77 |
+
attributes = ["image_processor", "tokenizer"]
|
| 78 |
+
valid_kwargs = [
|
| 79 |
+
"chat_template",
|
| 80 |
+
"image_std",
|
| 81 |
+
"min_pixels",
|
| 82 |
+
"image_mean",
|
| 83 |
+
"merge_size",
|
| 84 |
+
"image_processor_type",
|
| 85 |
+
"temporal_patch_size",
|
| 86 |
+
"patch_size",
|
| 87 |
+
"max_pixels",
|
| 88 |
+
]
|
| 89 |
+
|
| 90 |
+
image_processor_class = "AutoImageProcessor"
|
| 91 |
+
tokenizer_class = "AutoTokenizer"
|
| 92 |
+
|
| 93 |
+
def __init__(
|
| 94 |
+
self, image_processor=None, tokenizer=None, chat_template=None, **kwargs
|
| 95 |
+
):
|
| 96 |
+
self.image_token = (
|
| 97 |
+
"<|IMAGE_PLACEHOLDER|>"
|
| 98 |
+
if not hasattr(tokenizer, "image_token")
|
| 99 |
+
else tokenizer.image_token
|
| 100 |
+
)
|
| 101 |
+
self.video_token = (
|
| 102 |
+
"<|video_pad|>"
|
| 103 |
+
if not hasattr(tokenizer, "video_token")
|
| 104 |
+
else tokenizer.video_token
|
| 105 |
+
)
|
| 106 |
+
super().__init__(image_processor, tokenizer, chat_template=chat_template)
|
| 107 |
+
|
| 108 |
+
def __call__(
|
| 109 |
+
self,
|
| 110 |
+
images: ImageInput = None,
|
| 111 |
+
text: Union[
|
| 112 |
+
TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]
|
| 113 |
+
] = None,
|
| 114 |
+
videos: VideoInput = None,
|
| 115 |
+
**kwargs: Unpack[PaddleOCRVLProcessorKwargs],
|
| 116 |
+
) -> BatchFeature:
|
| 117 |
+
"""
|
| 118 |
+
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
|
| 119 |
+
and `kwargs` arguments to Qwen2TokenizerFast's [`~Qwen2TokenizerFast.__call__`] if `text` is not `None` to encode
|
| 120 |
+
the text. To prepare the vision inputs, this method forwards the `vision_infos` and `kwrags` arguments to
|
| 121 |
+
SiglipImageProcessor's [`~SiglipImageProcessor.__call__`] if `vision_infos` is not `None`.
|
| 122 |
+
|
| 123 |
+
Args:
|
| 124 |
+
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
|
| 125 |
+
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
|
| 126 |
+
tensor. Both channels-first and channels-last formats are supported.
|
| 127 |
+
text (`str`, `List[str]`, `List[List[str]]`):
|
| 128 |
+
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
|
| 129 |
+
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
|
| 130 |
+
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
| 131 |
+
videos (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, `List[torch.Tensor]`):
|
| 132 |
+
The image or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch
|
| 133 |
+
tensor, or a nested list of 3D frames. Both channels-first and channels-last formats are supported.
|
| 134 |
+
return_tensors (`str` or [`~utils.TensorType`], *optional*):
|
| 135 |
+
If set, will return tensors of a particular framework. Acceptable values are:
|
| 136 |
+
- `'tf'`: Return TensorFlow `tf.constant` objects.
|
| 137 |
+
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
| 138 |
+
- `'np'`: Return NumPy `np.ndarray` objects.
|
| 139 |
+
- `'jax'`: Return JAX `jnp.ndarray` objects.
|
| 140 |
+
|
| 141 |
+
Returns:
|
| 142 |
+
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
|
| 143 |
+
|
| 144 |
+
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
|
| 145 |
+
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
|
| 146 |
+
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
|
| 147 |
+
`None`).
|
| 148 |
+
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
|
| 149 |
+
- **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
|
| 150 |
+
- **image_grid_thw** -- List of image 3D grid in LLM. Returned when `images` is not `None`.
|
| 151 |
+
- **video_grid_thw** -- List of video 3D grid in LLM. Returned when `videos` is not `None`.
|
| 152 |
+
- **second_per_grid_ts** -- List of video seconds per time grid. Returned when `videos` is not `None`.
|
| 153 |
+
"""
|
| 154 |
+
output_kwargs = self._merge_kwargs(
|
| 155 |
+
PaddleOCRVLProcessorKwargs,
|
| 156 |
+
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
|
| 157 |
+
**kwargs,
|
| 158 |
+
)
|
| 159 |
+
|
| 160 |
+
if images is not None:
|
| 161 |
+
image_inputs = self.image_processor(images=images, return_tensors="pt")
|
| 162 |
+
image_inputs["pixel_values"] = image_inputs["pixel_values"]
|
| 163 |
+
image_grid_thw = image_inputs["image_grid_thw"]
|
| 164 |
+
|
| 165 |
+
else:
|
| 166 |
+
image_inputs = {}
|
| 167 |
+
image_grid_thw = None
|
| 168 |
+
|
| 169 |
+
if videos is not None:
|
| 170 |
+
# TODO: add video processing
|
| 171 |
+
videos_inputs = self.image_processor(
|
| 172 |
+
images=None, videos=videos, **output_kwargs["images_kwargs"]
|
| 173 |
+
)
|
| 174 |
+
video_grid_thw = videos_inputs["video_grid_thw"]
|
| 175 |
+
|
| 176 |
+
fps = output_kwargs["videos_kwargs"].pop("fps", 2.0)
|
| 177 |
+
if isinstance(fps, (int, float)):
|
| 178 |
+
second_per_grid_ts = [
|
| 179 |
+
self.image_processor.temporal_patch_size / fps
|
| 180 |
+
] * len(video_grid_thw)
|
| 181 |
+
elif hasattr(fps, "__len__") and len(fps) == len(video_grid_thw):
|
| 182 |
+
second_per_grid_ts = [
|
| 183 |
+
self.image_processor.temporal_patch_size / tmp for tmp in fps
|
| 184 |
+
]
|
| 185 |
+
else:
|
| 186 |
+
raise ValueError(
|
| 187 |
+
f"The length of fps ({len(fps) if hasattr(fps, '__len__') else fps}) must be equal to the length of video_grid_thw ({len(video_grid_thw)}) or fps should be a single number."
|
| 188 |
+
)
|
| 189 |
+
videos_inputs.update(
|
| 190 |
+
{"second_per_grid_ts": torch.tensor(second_per_grid_ts)}
|
| 191 |
+
)
|
| 192 |
+
|
| 193 |
+
else:
|
| 194 |
+
videos_inputs = {}
|
| 195 |
+
video_grid_thw = None
|
| 196 |
+
|
| 197 |
+
if not isinstance(text, list):
|
| 198 |
+
text = [text]
|
| 199 |
+
|
| 200 |
+
if image_grid_thw is not None:
|
| 201 |
+
index = 0
|
| 202 |
+
for i in range(len(text)):
|
| 203 |
+
while self.image_token in text[i]:
|
| 204 |
+
text[i] = text[i].replace(
|
| 205 |
+
self.image_token,
|
| 206 |
+
"<|placeholder|>"
|
| 207 |
+
* (
|
| 208 |
+
image_grid_thw[index].prod()
|
| 209 |
+
// self.image_processor.merge_size
|
| 210 |
+
// self.image_processor.merge_size
|
| 211 |
+
),
|
| 212 |
+
1,
|
| 213 |
+
)
|
| 214 |
+
index += 1
|
| 215 |
+
text[i] = text[i].replace("<|placeholder|>", self.image_token)
|
| 216 |
+
|
| 217 |
+
if video_grid_thw is not None:
|
| 218 |
+
index = 0
|
| 219 |
+
for i in range(len(text)):
|
| 220 |
+
while self.video_token in text[i]:
|
| 221 |
+
text[i] = text[i].replace(
|
| 222 |
+
self.video_token,
|
| 223 |
+
"<|placeholder|>"
|
| 224 |
+
* (
|
| 225 |
+
video_grid_thw[index].prod()
|
| 226 |
+
// self.image_processor.merge_size
|
| 227 |
+
// self.image_processor.merge_size
|
| 228 |
+
),
|
| 229 |
+
1,
|
| 230 |
+
)
|
| 231 |
+
index += 1
|
| 232 |
+
text[i] = text[i].replace("<|placeholder|>", self.video_token)
|
| 233 |
+
|
| 234 |
+
text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
|
| 235 |
+
|
| 236 |
+
return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs})
|
| 237 |
+
|
| 238 |
+
def batch_decode(self, *args, **kwargs):
|
| 239 |
+
"""
|
| 240 |
+
This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
|
| 241 |
+
refer to the docstring of this method for more information.
|
| 242 |
+
"""
|
| 243 |
+
return self.tokenizer.batch_decode(*args, **kwargs)
|
| 244 |
+
|
| 245 |
+
def decode(self, *args, **kwargs):
|
| 246 |
+
"""
|
| 247 |
+
This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
|
| 248 |
+
the docstring of this method for more information.
|
| 249 |
+
"""
|
| 250 |
+
return self.tokenizer.decode(*args, **kwargs)
|
| 251 |
+
|
| 252 |
+
def post_process_image_text_to_text(
|
| 253 |
+
self,
|
| 254 |
+
generated_outputs,
|
| 255 |
+
skip_special_tokens=True,
|
| 256 |
+
clean_up_tokenization_spaces=False,
|
| 257 |
+
**kwargs,
|
| 258 |
+
):
|
| 259 |
+
"""
|
| 260 |
+
Post-process the output of the model to decode the text.
|
| 261 |
+
|
| 262 |
+
Args:
|
| 263 |
+
generated_outputs (`torch.Tensor` or `np.ndarray`):
|
| 264 |
+
The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
|
| 265 |
+
or `(sequence_length,)`.
|
| 266 |
+
skip_special_tokens (`bool`, *optional*, defaults to `True`):
|
| 267 |
+
Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method.
|
| 268 |
+
Clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
|
| 269 |
+
Whether or not to clean up the tokenization spaces. Argument passed to the tokenizer's `batch_decode` method.
|
| 270 |
+
**kwargs:
|
| 271 |
+
Additional arguments to be passed to the tokenizer's `batch_decode method`.
|
| 272 |
+
|
| 273 |
+
Returns:
|
| 274 |
+
`List[str]`: The decoded text.
|
| 275 |
+
"""
|
| 276 |
+
return self.tokenizer.batch_decode(
|
| 277 |
+
generated_outputs,
|
| 278 |
+
skip_special_tokens=skip_special_tokens,
|
| 279 |
+
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
|
| 280 |
+
**kwargs,
|
| 281 |
+
)
|
| 282 |
+
|
| 283 |
+
@property
|
| 284 |
+
def model_input_names(self):
|
| 285 |
+
tokenizer_input_names = self.tokenizer.model_input_names
|
| 286 |
+
image_processor_input_names = self.image_processor.model_input_names
|
| 287 |
+
names_from_processor = list(
|
| 288 |
+
dict.fromkeys(tokenizer_input_names + image_processor_input_names)
|
| 289 |
+
)
|
| 290 |
+
return names_from_processor + ["second_per_grid_ts"]
|
| 291 |
+
|
| 292 |
+
|
| 293 |
+
__all__ = ["PaddleOCRVLProcessor", "PaddleOCRVLProcessor"]
|
processor_config.json
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"auto_map": {
|
| 3 |
+
"AutoProcessor": "processing_paddleocr_vl.PaddleOCRVLProcessor"
|
| 4 |
+
},
|
| 5 |
+
"processor_class": "PaddleOCRVLProcessor"
|
| 6 |
+
}
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"additional_special_tokens": [
|
| 3 |
+
"<|IMAGE_PLACEHOLDER|>",
|
| 4 |
+
"<|image_pad|>",
|
| 5 |
+
"<|IMAGE_START|>",
|
| 6 |
+
"<|IMAGE_END|>",
|
| 7 |
+
"<|video_pad|>"
|
| 8 |
+
],
|
| 9 |
+
"bos_token": {
|
| 10 |
+
"content": "<s>",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"cls_token": {
|
| 17 |
+
"content": "<|begin_of_sentence|>",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"eos_token": {
|
| 24 |
+
"content": "</s>",
|
| 25 |
+
"lstrip": false,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
},
|
| 30 |
+
"mask_token": {
|
| 31 |
+
"content": "<mask:1>",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false
|
| 36 |
+
},
|
| 37 |
+
"pad_token": {
|
| 38 |
+
"content": "<unk>",
|
| 39 |
+
"lstrip": false,
|
| 40 |
+
"normalized": false,
|
| 41 |
+
"rstrip": false,
|
| 42 |
+
"single_word": false
|
| 43 |
+
},
|
| 44 |
+
"sep_token": {
|
| 45 |
+
"content": "<|end_of_sentence|>",
|
| 46 |
+
"lstrip": false,
|
| 47 |
+
"normalized": false,
|
| 48 |
+
"rstrip": false,
|
| 49 |
+
"single_word": false
|
| 50 |
+
},
|
| 51 |
+
"unk_token": {
|
| 52 |
+
"content": "<unk>",
|
| 53 |
+
"lstrip": false,
|
| 54 |
+
"normalized": false,
|
| 55 |
+
"rstrip": false,
|
| 56 |
+
"single_word": false
|
| 57 |
+
}
|
| 58 |
+
}
|
tokenizer.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f90f04fd8e5eb6dfa380f37d10c87392de8438dccb6768a2486b5a96ee76dba6
|
| 3 |
+
size 11187679
|
tokenizer.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:34ef7db83df785924fb83d7b887b6e822a031c56e15cff40aaf9b982988180df
|
| 3 |
+
size 1614363
|
tokenizer_config.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|