[Issue] Repetitive text with endless loop

#89
by abhi22 - opened

While testing the model on multilingual document text images, It starts generating repetitive text with endless loop for the unknown language.

Is their a workaround to this problem such that the model could be instructed to extract only specific <language, for example: 'en'> text and discard all other information rather than unnecessarily trying to understand it?

abhi22 changed discussion status to closed
abhi22 changed discussion status to open
  1. Tried with changing the prompt, didn't worked out!
  2. Applied a patch to detect repetitive tokens based on a certain threshold, Helped but it also stops model inference to provide complete OCR of the image.

I have the same problem while testing these samples with Ollama
Specifically: form_01.png, table_10.png and tablet_12.png cause it to loop forever with ollama run deepseek-ocr "/path/to/image\n<|grounding|>Convert the document to markdown."

Have you tried adding <|grounding|>? It fixed the infinite looping for me

Let me try out <|grounding|> and update the results here.

The DeepSeek OCR model often returns the content of this tag <|det|>, but these pages actually contain text that is not returned. The performance is far from what I expected. What could be wrong? I am using the API version from https://cloud.siliconflow.cn/.

Page 32: ██████████▏ | 31/258 [03:12<17:41, 4.67s/it]

<|ref|>text<|/ref|><|det|>[[57, 110, 491, 151]]<|/det|>

<|ref|>text<|/ref|><|det|>[[57, 154, 491, 195]]<|/det|>

<|ref|>text<|/ref|><|det|>[[57, 198, 491, 239]]<|/det|>

<|ref|>text<|/ref|><|det|>[[57, 243, 491, 283]]<|/det|>

<|ref|>title<|/ref|><|det|>[[58, 262, 118, 280]]<|/det|>

<|ref|>text<|/ref|><|det|>[[81, 284, 295, 302]]<|/det|>

<|ref|>text<|/ref|><|det|>[[57, 306, 491, 346]]<|/det|>

<|ref|>table<|/ref|><|det|>[[91, 350, 455, 627]]<|/det|>

<|ref|>title<|/ref|><|det|>[[82, 630, 151, 646]]<|/det|>

<|ref|>text<|/ref|><|det|>[[58, 651, 491, 712]]<|/det|>

<|ref|>title<|/ref|><|det|>[[82, 716, 191, 733]]<|/det|>

<|ref|>text<|/ref|><|det|>[[57, 737, 491, 777]]<|/det|>

<|ref|>text<|/ref|><|det|>[[82, 781, 219, 798]]<|/det|>

<|ref|>text<|/ref|><|det|>[[82, 802, 237, 819]]<|/det|>

<|ref|>text<|/ref|><|det|>[[57, 823, 491, 863]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 110, 937, 150]]<|/det|>

<|ref|>text<|/ref|><|det|>[[527, 154, 923, 172]]<|/det|>

<|ref|>text<|/ref|><|det|>[[529, 176, 636, 192]]<|/det|>

<|ref|>text<|/ref|><|det|>[[529, 198, 743, 215]]<|/det|>

<|ref|>text<|/ref|><|det|>[[529, 220, 808, 237]]<|/det|>

<|ref|>text<|/ref|><|det|>[[527, 242, 814, 259]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 264, 937, 325]]<|/det|>

<|ref|>title<|/ref|><|det|>[[529, 330, 588, 346]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 352, 937, 387]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 392, 937, 453]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 458, 937, 540]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 545, 937, 586]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 590, 937, 672]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 676, 937, 738]]<|/det|>

<|ref|>title<|/ref|><|det|>[[508, 703, 546, 719]]<|/det|>

<|ref|>text<|/ref|><|det|>[[529, 724, 730, 741]]<|/det|>

<|ref|>text<|/ref|><|det|>[[527, 746, 885, 764]]<|/det|>

<|ref|>text<|/ref|><|det|>[[527, 768, 887, 785]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 790, 937, 830]]<|/det|>

After more testing, it happens both in HuggingFace transformers and Ollama, so the Model itself is broken

The DeepSeek OCR model often returns the content of this tag <|det|>, but these pages actually contain text that is not returned. The performance is far from what I expected. What could be wrong? I am using the API version from https://cloud.siliconflow.cn/.

Page 32: ██████████▏ | 31/258 [03:12<17:41, 4.67s/it]

<|ref|>text<|/ref|><|det|>[[57, 110, 491, 151]]<|/det|>

<|ref|>text<|/ref|><|det|>[[57, 154, 491, 195]]<|/det|>

<|ref|>text<|/ref|><|det|>[[57, 198, 491, 239]]<|/det|>

<|ref|>text<|/ref|><|det|>[[57, 243, 491, 283]]<|/det|>

<|ref|>title<|/ref|><|det|>[[58, 262, 118, 280]]<|/det|>

<|ref|>text<|/ref|><|det|>[[81, 284, 295, 302]]<|/det|>

<|ref|>text<|/ref|><|det|>[[57, 306, 491, 346]]<|/det|>

<|ref|>table<|/ref|><|det|>[[91, 350, 455, 627]]<|/det|>

<|ref|>title<|/ref|><|det|>[[82, 630, 151, 646]]<|/det|>

<|ref|>text<|/ref|><|det|>[[58, 651, 491, 712]]<|/det|>

<|ref|>title<|/ref|><|det|>[[82, 716, 191, 733]]<|/det|>

<|ref|>text<|/ref|><|det|>[[57, 737, 491, 777]]<|/det|>

<|ref|>text<|/ref|><|det|>[[82, 781, 219, 798]]<|/det|>

<|ref|>text<|/ref|><|det|>[[82, 802, 237, 819]]<|/det|>

<|ref|>text<|/ref|><|det|>[[57, 823, 491, 863]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 110, 937, 150]]<|/det|>

<|ref|>text<|/ref|><|det|>[[527, 154, 923, 172]]<|/det|>

<|ref|>text<|/ref|><|det|>[[529, 176, 636, 192]]<|/det|>

<|ref|>text<|/ref|><|det|>[[529, 198, 743, 215]]<|/det|>

<|ref|>text<|/ref|><|det|>[[529, 220, 808, 237]]<|/det|>

<|ref|>text<|/ref|><|det|>[[527, 242, 814, 259]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 264, 937, 325]]<|/det|>

<|ref|>title<|/ref|><|det|>[[529, 330, 588, 346]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 352, 937, 387]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 392, 937, 453]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 458, 937, 540]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 545, 937, 586]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 590, 937, 672]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 676, 937, 738]]<|/det|>

<|ref|>title<|/ref|><|det|>[[508, 703, 546, 719]]<|/det|>

<|ref|>text<|/ref|><|det|>[[529, 724, 730, 741]]<|/det|>

<|ref|>text<|/ref|><|det|>[[527, 746, 885, 764]]<|/det|>

<|ref|>text<|/ref|><|det|>[[527, 768, 887, 785]]<|/det|>

<|ref|>text<|/ref|><|det|>[[507, 790, 937, 830]]<|/det|>

same issue here, it was kv cache quantizationm, small models suffer huge with quants, in bf16 works like a charm

same issue here, it was kv cache quantizationm, small models suffer huge with quants, in bf16 works like a charm

Did that also fixes the looping bug?

I tried <|grounding|>, It doesn't fix the issue. With grounding, It'll return empty texts and without grounding it gets stuck with repetitive tokens especially when the language is not en(document consists multilingual data).

After testing with different parameter settings and patch for max_tokens, I was able to handle the issue. It may not be fixed because it's the issue with model. Here's the configuration that i used:

PATCH:
model.generation_config.max_new_tokens = 2048

MODEL INFERENCE CONFIGS:
"base_size": 640
"image_size": 640
"crop_mode": True

yep

Sadly for Ollama, the default K/V Cache is f16, setting it by hand set "OLLAMA_KV_CACHE_TYPE=f16" doesn't change anything, it still loops forever

aaaa I seee, Im using VLlM

Sign up or log in to comment