German-OCR

---
license: apache-2.0
language:
- de
- en
base_model: Qwen/Qwen2-VL-2B-Instruct
tags:
- ocr
- german
- vision
- document-understanding
- invoice
- qwen2-vl
pipeline_tag: image-text-to-text
library_name: transformers
datasets:
- neuralabs/german-synth-ocr
---
<p align="center">
  <img src="https://raw.githubusercontent.com/Keyvanhardani/german-ocr/main/docs/logo.png" alt="German-OCR Logo" width="600"/>
</p>

<h1 align="center">German-OCR</h1>

<p align="center">
  <strong>High-performance German document OCR using fine-tuned Qwen2-VL-2B  & Qwen2.5-VL-3B vision-language model</strong>
</p>


## Model Description

German-OCR is specifically trained to extract text from German documents including invoices, receipts, forms, and other business documents. It outputs structured text in Markdown format.

- **Base Model**: Qwen/Qwen2-VL-2B-Instruct
- **Fine-tuning**: QLoRA (4-bit quantization)
- **Training Data**: German invoices and business documents
- **Output Format**: Markdown structured text

## Model Variants

| Model | Size | Base | HuggingFace |
|-------|------|------|-------------|
| german-ocr | 4.4 GB | Qwen2-VL-2B | [Keyven/german-ocr](https://huggingface.co/Keyven/german-ocr) |
| german-ocr-3b | 7.5 GB | Qwen2.5-VL-3B | [Keyven/german-ocr-3b](https://huggingface.co/Keyven/german-ocr-3b) |

## Usage

### Option 1: Python Package (Recommended)

```bash
pip install german-ocr
```

```python
from german_ocr import GermanOCR

# Using Ollama (fast, local)
ocr = GermanOCR(backend="ollama")
result = ocr.extract("document.png")
print(result)

# Using Transformers (more accurate)
ocr = GermanOCR(backend="transformers")
result = ocr.extract("document.png")
print(result)
```

### Option 2: Ollama

[!WARNING]
  > **In Entwicklung** - Vision-Adapter Kompatibilität wird noch bearbeitet. Für stabile Nutzung:
  [HuggingFace-Version](https://huggingface.co/Keyven/german-ocr) empfohlen.

```bash
ollama run Keyvan/german-ocr "Extrahiere den Text: image.png"
```

### Option 3: Transformers

```python
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Keyven/german-ocr",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Keyven/german-ocr")

image = Image.open("document.png")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Extrahiere den Text aus diesem Dokument."}
    ]
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512)
result = processor.batch_decode(
    output_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True
)[0]
print(result)
```

## Performance

| Metric | Value |
|--------|-------|
| Base Model | Qwen2-VL-2B-Instruct |
| Model Size | 4.4 GB |
| VRAM (4-bit) | 1.5 GB |
| Inference Time | ~15s (GPU) |

## Training

- **Method**: QLoRA (4-bit quantization)
- **Epochs**: 3
- **Learning Rate**: 2e-4
- **LoRA Rank**: 64
- **Target Modules**: All linear layers

## Limitations

- Optimized for German documents
- Best results with clear, high-resolution images
- May struggle with handwritten text

## License

Apache 2.0

## Author

**Keyvan Hardani**
- Website: [keyvan.ai](https://keyvan.ai)
- LinkedIn: [linkedin.com/in/keyvanhardani](https://www.linkedin.com/in/keyvanhardani/)
- GitHub: [@Keyvanhardani](https://github.com/Keyvanhardani)

## Links

- [GitHub](https://github.com/Keyvanhardani/german-ocr)
- [Ollama](https://ollama.com/Keyvan/german-ocr)
- [HuggingFace](https://huggingface.co/Keyven/german-ocr)