Output contains many \n

by purpledeerz - opened Nov 3

Nov 3

Hi, first of all I want to say this is a great model, I've had great success with it so far, thank you for your contribution!

Second of all, sometimes I have an issue with pages that are mostly empty or contain little text. The model outputs a lot of \n\n\n\n\n\n\n ... chars, which makes the processing overall slower.
Is there a way to mitigate this?

Currently I am converting all PDF pages to images with dpi=300, and send those images one at a time to vLLM.
I'm happy to provide more info if needed.

purpledeerz

Nov 3

For anyone having the same issue, setting a stop token did help.
I am still curious to know if there are any other workarounds.

completion = client.chat.completions.create(
  model="LightOnOcr",
  messages=[
    {
      "role": "user", 
      "content": [
        {
          "type": "image_url",
          "image_url": {"url": f"data:image/png;base64,{image_base64}"}
        },
      ]
    }
  ],
  max_tokens=4096,
  temperature=0.2,
  top_p=0.9,
  stop=["\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n"],
)

staghado

LightOn AI org Nov 3

Hi,
Thank you for your feedback on LightOnOCR!
We're working on improving the handling of empty pages for the next version.

For DPI, we have used a constant value of 200 throughout training so that might align better at inference.

Could you also share an example image that's causing issues if it's possible?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment