Output contains many \n
Hi, first of all I want to say this is a great model, I've had great success with it so far, thank you for your contribution!
Second of all, sometimes I have an issue with pages that are mostly empty or contain little text. The model outputs a lot of \n\n\n\n\n\n\n ... chars, which makes the processing overall slower.
Is there a way to mitigate this?
Currently I am converting all PDF pages to images with dpi=300, and send those images one at a time to vLLM.
I'm happy to provide more info if needed.
For anyone having the same issue, setting a stop token did help.
I am still curious to know if there are any other workarounds.
completion = client.chat.completions.create(
model="LightOnOcr",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_base64}"}
},
]
}
],
max_tokens=4096,
temperature=0.2,
top_p=0.9,
stop=["\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n"],
)
Hi,
Thank you for your feedback on LightOnOCR!
We're working on improving the handling of empty pages for the next version.
For DPI, we have used a constant value of 200 throughout training so that might align better at inference.
Could you also share an example image that's causing issues if it's possible?